[rsyslog] output plugin calling interface

Rainer Gerhards rgerhards at hq.adiscon.com
Fri May 1 21:56:18 CEST 2009


David, 

> -----Original Message-----
> From: rsyslog-bounces at lists.adiscon.com 
> [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of david at lang.hm
> Sent: Thursday, April 30, 2009 11:57 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] output plugin calling interface
> >> 1. something wrong with a message.
> >>     retries won't help (with batches we need to narrow down
> >> to which message)
> >
> > But what could be wrong? For which outputs?
> >
> > Can you envison anything other than db outputs only, 
> duplicate identity in
> > column errornously declared with a unique index (which I 
> still consider an
> > admin error)? I really fail to find any other sample.
> 
> things like quoting errors (which I know you defend against)
> 
> unicode text causing grief
> 
> dynafile hits a read-only file
> 
> basicly data-driven things that trigger bugs in the message delivery 
> mechanism in some form.
> 
> this is easiest to understand and provide examples with 
> databases, but you 
> could have problems with other methods.
> 
> in an ideal world these would never happen, but for most 
> output types I 
> can think of some form of corrupt input that could cause that 
> message to 
> fail.

OK, thanks a lot. It doesn't matter if I can defend a case or two, but you
provided good examples. I think I now can be relieved that we have sufficient
probability to seriously look at these cases.

Resting sometimes helps, and so I also think I made a big step forward (while
not actively working on it ;)) today. I think what makes the problem so hard
to solve is the language we use. I thought about a purely mathematical model,
and I have one on my mind. The state diagram was a first step, but it went
not far enough. So I think I will try to write up that model and then we can
discuss based on it and finally derive the actual code from it. That's an
extra step, but I think it will be a useful one.

As a side-note, I have also identified that we have overlooked a subtle issue
so far: backup actions - they need to work on the subset of the batch that
had message permanent failures. So the message state actually needs to be
part of the message inside the batch. But now, I think, things really begin
to come together and are far less complex than initially thought.

One problem with the state chart - that was why I said it is not 100% correct
- is that it does not properly abstract batches vs. single messages. Both of
them entangled in a way that I thought [;)] to be very complex. But if you
model that with processing states, then the batch processing state is simply
a function of the individual message processing states.

Please let me know if you also find a math model useful (but I'll probably
need to do it in any case, because it helps me clean up my mind...).

Rainer
> 
> >>
> >> 2. something wrong with the destination.
> >>     in this case waiting and retrying makes sense, the
> >> destination may
> >> recover (either on it's own, or through sysadmin assistance)
> >
> > Yes, potentially recoverable.
> >
> >>
> >> 3. internal rsyslog issue (out of memory or similar)
> >
> > That's potentially recoverable, too. The most concerning 
> issue is running out
> > of process address space, but even this can be recovered 
> from (at least in
> > theory, in practice I doubt that rsyslog really will survive such a
> > situation)
> 
> agreed.
> 
> >>
> >> I'm not thinking of things that don't fall in one of these
> >> buckets, but
> >> I'll add more later if I think of them.
> >
> > The bottom line is that I see only a single case, and that 
> is caused by
> > config error.
> 
> if you define failure to appropriatly handle odd characters 
> (failing to 
> change control characters for example) to be a config error I 
> can agree 
> with you.
> 
> David Lang
> 
> > Rainer
> >
> >>
> >>> Rainer
> >>> PS: tomorrow is a public holiday over here, I may be out
> >> with friends and
> >>> family and don't know if I will get to my mail.
> >>
> >> no problem.
> >>
> >> David Lang
> >>
> >>>> -----Original Message-----
> >>>> From: rsyslog-bounces at lists.adiscon.com
> >>>> [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of
> >> david at lang.hm
> >>>> Sent: Thursday, April 30, 2009 9:42 PM
> >>>> To: rsyslog-users
> >>>> Subject: Re: [rsyslog] output plugin calling interface
> >>>>
> >>>> On Thu, 30 Apr 2009, Rainer Gerhards wrote:
> >>>>
> >>>>> I have just drawn a state diagram. It is not 100% perfect,
> >>>> but I think
> >>>>> it conveys the idea better than the text. Anyhow, you
> >> should read it
> >>>>> together with the text below, because the text is more
> >> precise. Note
> >>>>> that the diagram looks only at message states, thus
> >>>>> message-permanent-failure and action-permanent-failure lead
> >>>> to the some
> >>>>> state in the diagram:
> >>>>>
> >>>>> http://www.rsyslog.com/modules/Static_Docs/data/action-call.png
> >>>>>
> >>>>> I hope it is useful.
> >>>>>
> >>>>> I drew it with graphvic, the dot control file is in
> >>>> multi-dequeue git
> >>>>> branch and named action-call.dot - in case you'd like 
> to modify ;)
> >>>>
> >>>> this is definantly useful and I believe accuratly represents
> >>>> what you have
> >>>> described.
> >>>>
> >>>> what I was proposing would be slightly different
> >>>>
> >>>> from commit pending you would have a third possible result
> >>>>
> >>>> change the endTransaction _SUSPENDED to endTransaction
> >> _SUSPENDED, b=1
> >>>>
> >>>> add another line endTransaction _SUSPENDED, b>1 that changes
> >>>> b=b/2 and
> >>>> goes back to 'ready for processing'
> >>>>
> >>>> then on the 'retry succeeds' line change b=Batch_Size
> >>>>
> >>>> David Lang
> >>>>
> >>>>> Rainer
> >>>>>
> >>>>> On Thu, 2009-04-30 at 19:00 +0200, Rainer Gerhards wrote:
> >>>>>> I forgot to mention one other subtle, but important issue.
> >>>> I add it down
> >>>>>> below in the right context:
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> >>>>>>> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards
> >>>>>>> Sent: Thursday, April 30, 2009 5:47 PM
> >>>>>>> To: rsyslog-users
> >>>>>>> Subject: Re: [rsyslog] output plugin calling interface
> >>>>>>>
> >>>>>>> David,
> >>>>>>>
> >>>>>>> one more note. I think I have basically solved the
> >>>> initial discussion topic
> >>>>>>> (with the *action* look and I am back at the overall
> >>>> picture). I will write
> >>>>>>> up on the action lock later, I think I can save some time
> >>>> when we go down
> >>>>>> the
> >>>>>>> the other topic first.
> >>>>>>>
> >>>>>>> Rest inline below...
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> >>>>>>>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
> >>>>>>>> Sent: Thursday, April 30, 2009 1:32 AM
> >>>>>>>> To: rsyslog-users
> >>>>>>>> Subject: Re: [rsyslog] output plugin calling interface
> >>>>>>>>
> >>>>>>>> On Wed, 29 Apr 2009, Rainer Gerhards wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> the key to the locking is to try to both minimize the work
> >>>>>>>>>> done inside the
> >>>>>>>>>> lock, and minimize the time the lock is held.
> >>>>>>>>>>
> >>>>>>>>>> in general I think the locking should be something
> >>>> along the lines of
> >>>>>>>>>>
> >>>>>>>>>> doStartbatch()
> >>>>>>>>>>    lock action mutex
> >>>>>>>>>>      multiple doAction() calls
> >>>>>>>>>>    unlock action mutex
> >>>>>>>>>> doEndbatch()
> >>>>>>>>>
> >>>>>>>>> I overlooked a very important point, and it now appears
> >>>> to me. I *must*
> >>>>>>> lock
> >>>>>>>>> the complete batch, including doEndBatch(). The reason
> >>>> is that a single
> >>>>>>>>> app-level transaction otherwise has no definite start
> >>>> and end point. So
> >>>>>>> we
> >>>>>>>>> would not know what is committed and what not (if
> >>>> another thread puts
> >>>>>>>>> messages into the transaction queue. That breaks the
> >>>> whole model. The
> >>>>>>> only
> >>>>>>>>> solution is to hold the lock during the whole
> >> tranaction. As you
> >>>>>> outline,
> >>>>>>>>> this should not be too long. Even if it took
> >>>> considerable time, there
> >>>>>>> would
> >>>>>>>>> be limited usefulness in interleaving other doAction
> >>>> calls, as this
> >>>>>> would
> >>>>>>>>> simply cause additional time. At this point, I think
> >>>> there is no real
> >>>>>>>> benefit
> >>>>>>>>> in running multiple threads concurrently.
> >>>>>>>>
> >>>>>>>> you really don't want to hold the lock through the
> >>>> doEndBatch() call,
> >>>>>> that
> >>>>>>>> can potentially take a _long_ time, and that is the time
> >>>> when you most
> >>>>>>>> want the ability to have other things accessing the
> >>>> queue (note that I
> >>>>>> may
> >>>>>>>> be misunderstanding the definition of the lock here)
> >>>>>>>>
> >>>>>>>> the output module will be tied up this entire time, but
> >>>> you need other
> >>>>>>>> things to be able to access the queue (definantly
> >>>> addding things to the
> >>>>>>>> queue, but there is a win in having the ability for
> >>>> another reader thread
> >>>>>>>> to be pushing things to a different copy of the output
> >>>> module at the same
> >>>>>>>> time)
> >>>>>>>>
> >>>>>>>> so this means that when you are goign through and doing
> >>>> the doAction()
> >>>>>>>> call, you are marking that you are working on that queue
> >>>> entry. then you
> >>>>>>>> release the lock and the next reader that comes along
> >>>> will skip over the
> >>>>>>>> entries that you have claimed and work to deliver the
> >>>> next N messages.
> >>>>>>>>
> >>>>>>>> then when you get the results of doEndBatch() you go
> >>>> back and mark some
> >>>>>> or
> >>>>>>>> all of those messages as completed (removing them from
> >>>> the queue). note
> >>>>>>>> that with multiple worker threads you have the potential
> >>>> to have items
> >>>>>>>> that aren't the oldest ones in the queue being completed
> >>>> before the
> >>>>>> oldest
> >>>>>>>> ones, batching may make bigger holes, but the potential
> >>>> for holes was
> >>>>>>>> there all along.
> >>>>>>>>
> >>>>>>>>>> and the output module should be written to defer as much
> >>>>>>>>>> processing as
> >>>>>>>>>> possible to the doEndbatch() call to make the 
> doAction() call
> >>>>>>>>>> as fast as
> >>>>>>>>>> possible.
> >>>>>>>>>
> >>>>>>>>> Sounds reasonable.
> >>>>>>>>
> >>>>>>>> note that the reason for doAction deferring as much work
> >>>> as possible is
> >>>>>> to
> >>>>>>>> allow that work to be done outside of any locking
> >>>>>>>>
> >>>>>>>>>> since most errors will not be detected during doAction (in
> >>>>>>>>>> fact, the only
> >>>>>>>>>> errors I can think of that will happen at this point are
> >>>>>>>>>> rsyslog resource
> >>>>>>>>>> contratints), the error handling will need to be done after
> >>>>>>>>>> doEndbatc()
> >>>>>>>>>> returns
> >>>>>>>>>>
> >>>>>>>>>> at that point the output module may not know which of the
> >>>>>>>>>> messages caused
> >>>>>>>>>> the error (if the module sends the messages as a 
> transaction
> >>>>>>>>>> it may just
> >>>>>>>>>> know that the transaction failed, and have to do 
> retries with
> >>>>>>>>>> subsets to
> >>>>>>>>>> narrow down which message caused the failure)
> >>>>>>>>>>
> >>>>>>>>>> as long as at least one message is sucessful, 
> things are not
> >>>>>>>>>> blocked and
> >>>>>>>>>> should continue. it's only when doEndBatch() reports
> >>>> that no messages
> >>>>>>>>>> could be delivered that you have a possible reason to drop
> >>>>>>>>>> the message
> >>>>>>>>>> (and even then, only the first message. all others
> >>>> must be retried)
> >>>>>>>>>
> >>>>>>>>> Well, I wouldn't conclude that it is the first
> >> message, but "one
> >>>>>> message"
> >>>>>>>>> inside the batch. So there may be some benefit in
> >>>> retrying the batch
> >>>>>> with
> >>>>>>>>> less records (as you suggested). Under the assumption
> >>>> that usually only
> >>>>>>> one
> >>>>>>>>> record casues the problem, I tend to think that it may
> >>>> be useful to run
> >>>>>>>>> commit the batch one-by-one in this case - this may be
> >>>> more efficient
> >>>>>>> than a
> >>>>>>>>> binary search for the failing record.
> >>>>>>>>
> >>>>>>>> note that it's not _quite_ a binary search (same basic
> >>>> concept though),
> >>>>>> as
> >>>>>>>> you submit a subset of them they either go through or
> >>>> you need to try a
> >>>>>>>> smaller batch
> >>>>>>>
> >>>>>>> Ack
> >>>>>>>
> >>>>>>>>
> >>>>>>>> with the individual submissions you are O(N) (on average
> >>>> you will have to
> >>>>>>>> commit 1/2 the batch individually before you hit the
> >> bad one, ~6
> >>>>>>>> transactions for a batch size of 10, 51 for a batch
> >> size of 100)
> >>>>>>>>
> >>>>>>>> with the 'binary search' approach you are O(log(N)) ( ~4
> >>>> transactions for
> >>>>>>>> a batch size of 10, ~7 for a batch size of 100)
> >>>>>>>>
> >>>>>>>> the worst case is probably where the last message is the
> >>>> one that has the
> >>>>>>>> problem.
> >>>>>>>>
> >>>>>>>> for the individual processing that is simple math (batch
> >>>> size of 100, you
> >>>>>>>> will fail the first one, then submit 99 sucessfully,
> >>>> then fail on the
> >>>>>> last
> >>>>>>>> one)
> >>>>>>>>
> >>>>>>>> for the binary search it's more complicated (this is
> >>>> assuming the batch
> >>>>>>>> size gets bumped up when it succeeds)
> >>>>>>>>
> >>>>>>>> assuming the batch completly fails (i.e. a database
> >>>> where the output
> >>>>>>>> module doesn't know which one caused it to fail)
> >>>>>>>>
> >>>>>>>> bad message is message 100 and there are >>100 messages
> >>>> in the queue
> >>>>>>>> fail 100
> >>>>>>>> succeed 50 (bad message is now message 50)
> >>>>>>>> fail 100
> >>>>>>>> fail 50
> >>>>>>>> succeed 25 (bad message is now message 25)
> >>>>>>>> fail 100
> >>>>>>>> fail 50
> >>>>>>>> fail 25
> >>>>>>>> suceed 12 (bad message is now message 13
> >>>>>>>> fail 100
> >>>>>>>> fail 50
> >>>>>>>> fail 25
> >>>>>>>> suceed 12 (bad message is now message 1
> >>>>>>>> fail 100
> >>>>>>>> fail 50
> >>>>>>>> fail 25
> >>>>>>>> fail 12
> >>>>>>>> fail 6
> >>>>>>>> fail 3
> >>>>>>>> fail 1
> >>>>>>>> retry 1
> >>>>>>>> .
> >>>>>>>> .
> >>>>>>>> message 1 is bad (20 transactions + retries)
> >>>>>>>>
> >>>>>>>> best case would be
> >>>>>>>>
> >>>>>>>> bad message is message 1 and there are >>100 messages in
> >>>> the queue
> >>>>>>>> fail 100
> >>>>>>>> fail 50
> >>>>>>>> fail 25
> >>>>>>>> fail 12
> >>>>>>>> fail 6
> >>>>>>>> fail 3
> >>>>>>>> fail 1
> >>>>>>>> retry 1
> >>>>>>>> .
> >>>>>>>> .
> >>>>>>>> message 1 is bad (7 transactions + retries)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> if the output module is able to commit a partial
> >>>> transaction, then the
> >>>>>>>> logic devolves to
> >>>>>>>>
> >>>>>>>> bad message is message 100 and there are >>100 messages
> >>>> in the queue
> >>>>>>>> submit 100, succeed 99 bad message is message 1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> All in all, it looks like the algorithm needs to get
> >> a bit more
> >>>>>>> complicated
> >>>>>>>>> ;)
> >>>>>>>>
> >>>>>>>> unfortunantly, but not very much more complicated.
> >>>>>>>
> >>>>>>> We are actually working here on a very important piece of
> >>>> code. This is
> >>>>>> also
> >>>>>>> why I am thinking so hard on it. This algorithm, in hopefully
> >>>>>>> as-simple-as-possible (but not simpler than that ;))
> >>>> handles all the retry
> >>>>>>> logic, the output transaction, action resumption and
> >>>> recovery and is also
> >>>>>>> responsible that the upper-layer queue object is able to
> >>>> reliably persist
> >>>>>> not
> >>>>>>> fully processed batches of messages during a restart.
> >>>> This is why I (and
> >>>>>>> obviously you, too) put so much design effort into it.
> >>>>>>>
> >>>>>>> This is primarily a note for those list followers that
> >>>> wonder why we work
> >>>>>> so
> >>>>>>> hard on these few lines of code.
> >>>>>>>>
> >>>>>>>> the algorithm I posted last week cut the batch size in
> >>>> half for each loop
> >>>>>>>> and restored it when a commit succeded. you didn't want
> >>>> that to be in the
> >>>>>>>> core (and it doesn't have to be, but that means that the
> >>>> output module
> >>>>>> may
> >>>>>>>> need to do the retries)
> >>>>>>>
> >>>>>>> It's not that I didn't like it to be in core, I
> >>>> questioned (at least I hope
> >>>>>>> so) if it is useful at all. I just re-checked and this
> >>>> indeed is what I did
> >>>>>>> ;)
> >>>>>>>
> >>>>>>> I try to sum up the reason for this. So far, all output
> >>>> modules return one
> >>>>>> of
> >>>>>>> three states:
> >>>>>>>
> >>>>>>> RS_RET_OK,
> >>>>>>> which means everything went well
> >>>>>>>
> >>>>>>> RS_RET_SUSPENDED,
> >>>>>>> which means we had a temporary failure and the core
> >>>> should retry the action
> >>>>>>> some time later
> >>>>>>>
> >>>>>>> RS_RET_DISABLED,
> >>>>>>> which means that we had a permanent failure *of the
> >>>> action* and there is no
> >>>>>>> point in retrying the action any longer (this actually
> >>>> means the action has
> >>>>>>> died for the rest of the lifetime of this rsyslogd instance)
> >>>>>>>
> >>>>>>> Note that none of the standard modules returns an "I have
> >>>> a permanent
> >>>>>> failure
> >>>>>>> with only this message".
> >>>>>>>
> >>>>>>> Of course, it may be an oversight, and I tend to agree
> >>>> that we need to add
> >>>>>>> such a state (or better, a variety of states that provide
> >>>> back several
> >>>>>> causes
> >>>>>>> for message-permanent failures). But let's first consider
> >>>> why this state is
> >>>>>>> not yet already present. The simple truth is that we so
> >>>> far did not need
> >>>>>> it,
> >>>>>>> because such a failure is very seldom (and I initially thought
> >>>>>> non-existing).
> >>>>>>>
> >>>>>>> When can it happen? With the file output? No, why should
> >>>> one text string
> >>>>>> not
> >>>>>>> be able to be written, but the next one is. Syslog
> >>>> forwarding? No, same
> >>>>>>> reason. Why one message but not the next one? Same story
> >>>> for user messages
> >>>>>>> and so on. I think this covers all but the database outputs.
> >>>>>>>
> >>>>>>> Now let's look at the database outputs. It first looks
> >>>> like the same story,
> >>>>>>> but if you look more closely, it may be different (and
> >>>> this subtle issue
> >>>>>> did
> >>>>>>> not go into the design of the original database outputs).
> >>>> I can see that
> >>>>>> the
> >>>>>>> situation happens if a unique index is defined on one
> >>>> field. Then, if for
> >>>>>>> example we need to store a sequence of three records, and
> >>>> they have the
> >>>>>>> following key sequence: (a,a,b). So the second record
> >>>> will fail, as the
> >>>>>> first
> >>>>>>> record already had the same key, but the third record
> >>>> will work (thanks to
> >>>>>> a
> >>>>>>> different key). The original design was made under the
> >>>> assumption that such
> >>>>>> a
> >>>>>>> unique index placed on a field who's values are not 
> unique is a
> >>>>>> configuration
> >>>>>>> error (and I still tend to believe it is).
> >>>>>>>
> >>>>>>> One may now argue, that while this may be the case,
> >>>> rsyslog should be able
> >>>>>> to
> >>>>>>> recover from such user error. I tend to agree to that
> >>>> argument - but I am
> >>>>>> not
> >>>>>>> sure if some other folks would argue that this is indeed
> >>>> an user error and
> >>>>>> if
> >>>>>>> we permit losing this record, we actually have a data
> >>>> loss bug. There is
> >>>>>> also
> >>>>>>> some truth in that argument. It may be better to see this
> >>>> case as a
> >>>>>> temporary
> >>>>>>> failure (after all, the real cure is to remove the unique
> >>>> index or fix an
> >>>>>>> invalid sql statement). But again, I tend to follow the
> >>>> argument that this
> >>>>>> is
> >>>>>>> a message-permanent processing failure and as such the
> >>>> message needs to be
> >>>>>>> discarded so that further messages can be handled.
> >>>>>>>
> >>>>>>> HOWEVER, this is the *only* situation in which I can 
> think of a
> >>>>>>> message-permanent processing failure. No other case comes
> >>>> up my mind, and I
> >>>>>>> think this was also the conclusion of our Sep 2008 discussion.
> >>>>>>>
> >>>>>>> But if there is only one case, and it is very remote,
> >>>> does that actually
> >>>>>>> justify adding a (somewhat) complex algorithm if a
> >>>> simpler would also do?
> >>>>>> I'd
> >>>>>>> say that we can be fine with O(n) in case one of these
> >>>> very remote failures
> >>>>>>> happens. That means we could simply resort to doing every
> >>>> insert in its own
> >>>>>>> transaction (just like without batching). Granted, if
> >>>> endTransaction()
> >>>>>>> dominates the cost function, we have O(n) vs. O(1) in
> >>>> that case, but if
> >>>>>> that
> >>>>>>> case happens infrequently enough (what I assume), there
> >>>> really is no
> >>>>>>> difference between the two. And as you say, even the
> >>>> "binary search"
> >>>>>> approach
> >>>>>>> is not O(log(n)) but rather O(n/2) [yes, I know, but it
> >>>> makes sense
> >>>>>> here...].
> >>>>>>> So the gain in using that algorithm is even less (in
> >>>> essence, we are at
> >>>>>> O(n),
> >>>>>>> whatever we use for recovery).
> >>>>>>>
> >>>>>>> I may be overlooking some cases for message-permanent
> >>>> failures and, if I
> >>>>>> do,
> >>>>>>> I would be very grateful if you (or someone else) could
> >>>> point me to them.
> >>>>>>>
> >>>>>>> If you look at your algorithm (I've put it up here for
> >>>> easy reference:
> >>>>>>>
> >>>> http://blog.gerhards.net/2009/04/batch-output-handling-algorit
> >>>> hm.html ) and
> >>>>>>> what I have crafted yesterday, you'll see that my writeup
> >>>> is strongly
> >>>>>>> influenced by your proposal. However, as I wrote in my
> >>>> initial response to
> >>>>>>> your message, the algorithm fails at some subtleties that
> >>>> you did not know
> >>>>>> at
> >>>>>>> the time of writing (most importantly the push-model of
> >>>> the queue object).
> >>>>>>>
> >>>>>>> There are also some fine details of the retry handling
> >>>> that needs to go
> >>>>>> into
> >>>>>>> the algorithm, as well as proper handling of the now 
> four cases:
> >>>>>>>
> >>>>>>> * success
> >>>>>>> * temporary failure
> >>>>>>> * message-permanent failure
> >>>>>>> * action-permanent failure
> >>>>>>>
> >>>>>>
> >>>>>> What I forgot the mention is that we currently have a kind
> >>>> of "pseudo"
> >>>>>> message-permanent failure state. That is related to
> >>>> temporary failure. My
> >>>>>> description above was under the assumption that the action
> >>>> is configured for
> >>>>>> eternal retry (retryCount = -1). If a upper bound for the
> >>>> number of retries n
> >>>>>> is set, then and only then a message enters
> >>>> message-permanent failure state
> >>>>>> after n unsuccessful retries. If that happens, the message
> >>>> is discarded, the
> >>>>>> in-sequence retry counter is reset and the next message
> >>>> scheduled for
> >>>>>> processing (sounds like it would be a good idea to draw a
> >>>> state diagram...).
> >>>>>> The idea behind this is that it often is preferable to
> >>>> lose some messages if
> >>>>>> they cannot be processed rather than re-trying. But I have
> >>>> to admit that the
> >>>>>> functionality is rooted in rsyslog's past, where no
> >>>> capable queues existed. I
> >>>>>> still think it needs to be preserved - many use it and I
> >>>> see lots of use
> >>>>>> cases.
> >>>>>>
> >>>>>> But: If we have such an upper bound n, we need to think
> >>>> how to handle this
> >>>>>> situation for batches. If we have a batch size b > 1, the
> >>>> old interface
> >>>>>> retried n times before discarding the first message and 2n
> >>>> times before
> >>>>>> discarding the second. For the b-th record, we have bn
> >>>> retries before it is
> >>>>>> discarded (maybe one off each, not checked exactly). There
> >>>> is no point in
> >>>>>> modeling this with batches. So we could use between n..bn
> >>>> retries. The
> >>>>>> problem here is that b is not fixed - it is between one
> >>>> and the configured
> >>>>>> upper bound of the batch size. So we do not get (fully)
> >>>> deterministic
> >>>>>> behavior. Question is whether or not this is actually
> >>>> important, as the
> >>>>>> unavailability of the action's target (when looked from
> >>>> the rsyslog POV) is
> >>>>>> also not deterministic. So I tend to use a fixed number of
> >>>> retries for the
> >>>>>> whole batch. That could simply be n, because the user can
> >>>> configure it. Or we
> >>>>>> could derive a new n' from the configured one, e.g. by n'
> >>>> = n * b/10 (where b
> >>>>>> is the *actual* batch size, as above, not the upper
> >>>> bound!). That way, we the
> >>>>>> max number of retries would be related to the actual batch
> >>>> size and varies
> >>>>>> with it. Somehow, just a feeling, I'd *not* go for n' = nb
> >>>> (though this may
> >>>>>> just be an emotional position...).
> >>>>>>
> >>>>>> Comments appreciated.
> >>>>>>
> >>>>>> Rainer
> >>>>>>
> >>>>>>>>
> >>>>>>>> the question is which side the retry logic needs to be in.
> >>>>>>>>
> >>>>>>>> it can be in the queue walkder that calls the output module
> >>>>>>>>
> >>>>>>>> it can be in doEndBatch() in the output module
> >>>>>>>>
> >>>>>>>> some output modules don't need partial retries (as they
> >>>> can output
> >>>>>> partial
> >>>>>>>> batches)
> >>>>>>>>
> >>>>>>>> some output modules do need partial retries.
> >>>>>>>>
> >>>>>>>> the more complicated retry logic will work for both
> >>>> situations, or it can
> >>>>>>>> be implemented in each of how ever many output 
> modules need it.
> >>>>>>>>
> >>>>>>>> it can go either way, I tend to lean towards only having
> >>>> the logic in one
> >>>>>>>> place (even if it's more complicated logic than some
> >>>> modules need)
> >>>>>>>
> >>>>>>> I fully agree here with you. That functionality belongs
> >>>> into the core, we
> >>>>>>> just need to craft it well. I think the probably most
> >>>> important question at
> >>>>>>> this time is if I overlook some possible
> >>>> message-permanent failure cases.
> >>>>>> So
> >>>>>>> I would greatly appreciate feedback especially on that
> >>>> part of my reply.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Rainer
> >>>>>>>
> >>>>>>>>
> >>>>>>>> David Lang
> >>>>>>>> _______________________________________________
> >>>>>>>> rsyslog mailing list
> >>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>>>>>> http://www.rsyslog.com
> >>>>>>> _______________________________________________
> >>>>>>> rsyslog mailing list
> >>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>>>>> http://www.rsyslog.com
> >>>>>> _______________________________________________
> >>>>>> rsyslog mailing list
> >>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>>>> http://www.rsyslog.com
> >>>>>
> >>>>> _______________________________________________
> >>>>> rsyslog mailing list
> >>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>>> http://www.rsyslog.com
> >>>>>
> >>>> _______________________________________________
> >>>> rsyslog mailing list
> >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>> http://www.rsyslog.com
> >>>>
> >>> _______________________________________________
> >>> rsyslog mailing list
> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>> http://www.rsyslog.com
> >>>
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> >>
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
> 



More information about the rsyslog mailing list