[rsyslog] output plugin calling interface
david at lang.hm
david at lang.hm
Thu May 7 20:49:34 CEST 2009
On Thu, 7 May 2009, Rainer Gerhards wrote:
> I have now looked at the code and modified it, so I get some "feeling" of how
> it looks and works (it doesn't matter if I need to dump or modify that code,
> it took a few hours, less than writing other things ;)). So I am still "only"
> biased, but not without alternative. The code looks much cleaner than what is
> in v3, btw.
> But I think I have also come closer to where our opinions differ. I mentioned
> this morning that we have different design approaches. But that's not the
> full picture... let me quote you:
>>> Take care of the new state diagram and be sure to understand that it
>>> an own *action state*, not a batch transaction state (that's different
>>> for tomorrow ;)).
>> I'm not sure that there can be an error from the inTx stage that would be
>> worth retrying. errors there would not be related to outputting the
>> message, but simply to processing it and preparing it to be output later.
>> in fact, I'm not sure that retry belongs in the message state at all. I
>> could see it argued that the commit may result in a temporary error that
>> could be retried, but is that really something that the action (i.e.
>> output module) should deal with? or should this be done at the transaction
>> in reading the page after the diagram, it appears that you are thinking
>> the same thing, in which case the retry and suspend nodes should be
>> removed from the state diagram (or there may need to be a suspend node if
>> you want the higher levels to be able to try again and the module to
>> reject it)
>> looking at your pseudocode, I started to re-write it, and I think things
>> can be much simpler.
>> if the retrys are done above this level, then the only thing that we need
>> to do is to not hammer the destination.
>> except for the fact that doAction() can trigger an EndTransaction()
>> internally, there is no reason why doAction() can't take place while
>> suspended (the output module can be preparing the stuff to send out). the
>> only place that needs to deal with the issue is that the EndTransaction()
>> should sleep if the state is not itx
>> if doAction does beginTransaction() any time it's not in a transaction,
>> there is no reason to have it as a seperate call.
>> so without retries or beginTransaction, is there any reason for
>> prepareAction() to exist?
>> you also will need to detect that the doAction() did endTransaction() and
>> that you don't need to issue a endTransaction() for this output module now
>> (until you do the next doAction() )
> I think we probably have different failure cases on our mind. We touched
> this, but probably did not make the issue clear enough. I now think that
> these different classes of failures require different handling, probably at
> different layers of the engine. Maybe this can help to combine our both
> I was first tempted to start the description right here in mail, but instead
> I have added some text to the "internals document", hoping that the
> information may be useful in the future, too (and knowing that I need to edit
> it soon ;)).
> Note that I have NOT yet updated any other part of the document. It's
> probably also affected by thinking about failure cases.
> So, I'd appreciate if you could have a look at sections 3.2 and 3.3 of
overall it looks good.
one suggestion I would make is that since message based failures cannot be
reliably detected, I would consider using the same failure process for all
failures, and declare a message as bad if it fails the max retry number of
times by itself (once you hit n=1)
otherwise you end up resubmitting the entire batch a number of times
before you try to narrow it down to the particular message. since the
process of finding the bad message will take a number of retries, and then
you will want to retry the suspect message several times (to make sure
that it's really a message error, not a action error) this could result in
a lot of retries.
also, the algorithm that you posted has a subtle difference from what I
had listed. yours is more straightforward and easier to understand (and
requires no global knowledge), I think that mine is more efficiant in the
rare failure case. there is a potential (very subtle) race condition in
this area that will need attention when we get down to lower level
discussion (no matter which algorithm is used)
at this point I don't see this as critical (not even very important) as we
are talking high-level concepts at this point, but I wanted to note this
for a future conversation.
two notes on the reliability section
1. I think we had figured out that reliability required touching each item
3 times instead of 2 (not 4 times as you note in the text)
2. I disagree with you on the idea that power issues should be handled at
a different level. I'll try to track down some discussions on
sysadmin/security mailing lists about this.
More information about the rsyslog