[rsyslog] output plugin calling interface
Rainer Gerhards
rgerhards at hq.adiscon.com
Fri May 8 10:04:22 CEST 2009
> >> one suggestion I would make is that since message based
> >> failures cannot be
> >> reliably detected, I would consider using the same failure
> >> process for all
> >> failures, and declare a message as bad if it fails the max
> >> retry number of
> >> times by itself (once you hit n=1)
> >
> > But then you either
> >
> > A) do not need the batch logic at all (because the action is configured
for
> > infinite retries)
> >
> > Or
> >
> > B) you loose many messages if the action is not configured for infinite
> > retries and you have a longer-duration outage e.g. on a database server.
> > Let's say it is offline for a couple of hours, then you lose almost
> > everything in that period
> >
> > To prevent this, you need two different retry methods.
>
> good point.
>
> the problem is trying to figure out which type of failure you have.
I agree, but we face this problem in any case. For example, you can consider
the v3 engine to be using A) logic. That, by the way, was why it took me so
long to understand the other use case you validly described. I didn't see how
the retry handling could make a difference because the end result seemingly
was the same (but not so if you have two different failure scenarios and do
different handling). The moral from the story, I think, is that we must try
to differentiate between the two.
> some failures can be identified by the output module as being data driven
> or infrastructure, but there are cases where it just can't tell
> (especially when talking to remote servers, database, relp, etc)
>
> how should these be handled?
I think this mostly depends on the quality of the output module.
First of all, "mostly" implies that there may be some other cases, where it
really is impossible to differentiate between the two. In that case, I would
treat the issue as an action-caused failure. There are two reasons for this:
1) rsyslog v3 currently does this always and not even a single person
complained about that so far. This is an empiric argument, and it does not
mean it caused problems. But it carries the co-notation that this seems not
to be too bad.
2) If we would treat it as message-caused failure, we would no longer be able
to handle extended outages of destination systems, which I consider a vitally
important feature.
When weighing the two, I know of lots of people who rely on 2), in sharp
contrast to no person having problems with 1). So my conclusion is that it is
less problematic to define an otherwise undefinable failure reason to be
action-caused. Even more so as I assume this problem only exists in the
minority of cases.
Now back to the quality of the output module: thinking about databases, their
API is usually very good at conveying back if there was a SQL error or a
connection abort. So while a SQL error may also be an indication of a
configuration problem, I would strongly tend to treat it is a being
message-caused. This is under the assumption that any reasonable responsive
admin will hopefully test his configuration at least once before turning it
into production. And config SQL errors should manifest immediately, so I
expect these to be fixed before a configuration runs in production. So it is
the chore of the output module to interpret the return code it received from
its API and decide whether this is more likely action-caused or
message-caused. For database outputs, I would assume that it is always easy
to classify failures that can only be action-caused, especially in the
dominating case of a failed network connection or a failed server.
For other outputs it may not be as easy. But, for example, all stream network
outputs can detect a broken connection, so this also is a sure fit.
For dynafiles, it really depends on how hard it is tried to differentiate
between the two cases. But I think you can go great length here, too.
Especially if you do not only look at the creat() return code, but, iff a
failure occurs, you do more API calls to find out the cause.
So I think the remaining problem is small enough to cause not too much issues
(and if so, they are unavoidable in any case).
Rainer
More information about the rsyslog
mailing list