[rsyslog] audit-grade queue / batching

david at lang.hm david at lang.hm
Fri May 15 01:52:52 CEST 2009


On Thu, 14 May 2009, Rainer Gerhards wrote:

> Hi all, and especially David,
>
> contrary to what I intended, I have gone back to the text processor to sketch
> down what I intend to do. It turned out that looking at the code and doing
> small micro-modifications was counter-productive without a "big picture" (a
> sign of the magnitude of the change... ;)).
>
> So I have updated the design document, especially added section 4.6 (still,
> it contains a lot of not-yet-updated information!).
>
> http://www.rsyslog.com/download/design.pdf
>
> One important design decision is documented at the end of 4.6.2 and I would
> appreciate feedback on it. Note that even 4.6.2 is not yet fully completed,
> but what is there is consistent. It will probably take me at least another
> day to get the full intended design for the ultra-reliable queue into the
> document (and make my mind clear about it). But, again, what is in the
> document is consistent, "just" some algorithms are missing. So please do not
> wonder that not everything is described.

I'm looking through it now, the definition of 'audit-grade' looks 
reasonable. when using the example of the disk system, I would probably 
have said that it should be redundant (usually implemented as a RAID 
array). then in the area where you emphisize that all components need to 
be audit-grade, you can mention having a RAID card that doesn't have 
battery-backed cache as an easily overlooked item that would undermine the 
entire system.



in section 3.1 what do you mean by "Further, no context switch will happen 
between calls to doAction() and endTransaction()"? the term 'context 
switch' has many meanings, what are you intending here?




4.1.1, if doAction() will implicitly call beginTransaction() if it hasn't 
been called yet, I still am not seeing the value in exposing 
beginTransaction() explicitly. what advantage is there to ever calling it 
rather than just letting the first doAction call it?

how much value is there in allowing the doAction() call to not commit the 
current message being submitted? it adds significantly to the 
complications in dealing with status. allowing this does let the output 
module run closer to the end of the buffer (without this ability 
doAction() would have to consider it's buffer full if there is not enough 
space for a max-size message after adding the current message), but is 
this worth the complication?

doing this would simplify the pseudocode for prepareAction() to be

def prepareAction():
   if state == rtry:
     try recovery (adjust state accordingly)




I don't see how the return codes indicate an auto-commit _with_ m sub n 
(earlier you say the auto-commit may happen with or without the current 
message)


in section 4.3.2 you say (probable english translation issue)

"in spite of the two different failure cases, different handling is needed 
for them"

this doesn't make sense to me, if they are different failure cases, I 
would start off assuming that different handling would be needed. I would 
have said something like "because there are two different categories of 
failure causes, different handling is needed for them"


4.6.2, looking at the to-delete list, my reaction is OUCH!! (there is 
definantly a need for a non-sequential disk queue. but we don't want to go 
there now)

in the meantime, since it is allowable to duplicate messages in case of 
power failure, the to-delete list can be stored in ram (as opposed to 
needing to be stored on disk)

as an alturnative, instead of trying to store batch IDs (with all the 
problems that you accuratly describe), could you use queue slot numbers? 
instead of saying 'delete batch X' you just say 'delete slots 4-7'

or am I missing something here?

although since that can't work for sequential queues and you have to do a 
to-delete list for those, you are probably right that the right answer for 
now is to do the to-delete queue and allow future queue store definitions 
to replace that call.



one final thought, if a message gets a permanent failure and backup 
processing kicks in, can this still be audit-grade?

I'm thinking along the lines of inserting into a database, with a backup 
of writing the bad message to a local file.


overall this looks very good to me.

David Lang



More information about the rsyslog mailing list