[rsyslog] output plugin calling interface

david at lang.hm david at lang.hm
Fri May 8 10:20:28 CEST 2009


On Fri, 8 May 2009, Rainer Gerhards wrote:

>>>> one suggestion I would make is that since message based
>>>> failures cannot be
>>>> reliably detected, I would consider using the same failure
>>>> process for all
>>>> failures, and declare a message as bad if it fails the max
>>>> retry number of
>>>> times by itself (once you hit n=1)
>>>
>>> But then you either
>>>
>>> A) do not need the batch logic at all (because the action is configured
> for
>>> infinite retries)
>>>
>>> Or
>>>
>>> B) you loose many messages if the action is not configured for infinite
>>> retries and you have a longer-duration outage e.g. on a database server.
>>> Let's say it is offline for a couple of hours, then you lose almost
>>> everything in that period
>>>
>>> To prevent this, you need two different retry methods.
>>
>> good point.
>>
>> the problem is trying to figure out which type of failure you have.
>
> I agree, but we face this problem in any case. For example, you can consider
> the v3 engine to be using A) logic. That, by the way, was why it took me so
> long to understand the other use case you validly described. I didn't see how
> the retry handling could make a difference because the end result seemingly
> was the same (but not so if you have two different failure scenarios and do
> different handling). The moral from the story, I think, is that we must try
> to differentiate between the two.
>
>
>> some failures can be identified by the output module as being data driven
>> or infrastructure, but there are cases where it just can't tell
>> (especially when talking to remote servers, database, relp, etc)
>>
>> how should these be handled?
>
> I think this mostly depends on the quality of the output module.
>
> First of all, "mostly" implies that there may be some other cases, where it
> really is impossible to differentiate between the two. In that case, I would
> treat the issue as an action-caused failure. There are two reasons for this:
>
> 1) rsyslog v3 currently does this always and not even a single person
> complained about that so far. This is an empiric argument, and it does not
> mean it caused problems. But it carries the co-notation that this seems not
> to be too bad.
>
> 2) If we would treat it as message-caused failure, we would no longer be able
> to handle extended outages of destination systems, which I consider a vitally
> important feature.
>
> When weighing the two, I know of lots of people who rely on 2), in sharp
> contrast to no person having problems with 1). So my conclusion is that it is
> less problematic to define an otherwise undefinable failure reason to be
> action-caused. Even more so as I assume this problem only exists in the
> minority of cases.
>
> Now back to the quality of the output module: thinking about databases, their
> API is usually very good at conveying back if there was a SQL error or a
> connection abort. So while a SQL error may also be an indication of a
> configuration problem, I would strongly tend to treat it is a being
> message-caused. This is under the assumption that any reasonable responsive
> admin will hopefully test his configuration at least once before turning it
> into production. And config SQL errors should manifest immediately, so I
> expect these to be fixed before a configuration runs in production. So it is
> the chore of the output module to interpret the return code it received from
> its API and decide whether this is more likely action-caused or
> message-caused. For database outputs, I would assume that it is always easy
> to classify failures that can only be action-caused, especially in the
> dominating case of a failed network connection or a failed server.
>
> For other outputs it may not be as easy. But, for example, all stream network
> outputs can detect a broken connection, so this also is a sure fit.
>
> For dynafiles, it really depends on how hard it is tried to differentiate
> between the two cases. But I think you can go great length here, too.
> Especially if you do not only look at the creat() return code, but, iff a
> failure occurs, you do more API calls to find out the cause.
>
> So I think the remaining problem is small enough to cause not too much issues
> (and if so, they are unavoidable in any case).

sounds reasonable.

David Lang



More information about the rsyslog mailing list