[rsyslog] output plugin calling interface
david at lang.hm
david at lang.hm
Sat May 2 10:20:56 CEST 2009
On Sat, 2 May 2009, Rainer Gerhards wrote:
>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
>>
>> On Fri, 1 May 2009, Rainer Gerhards wrote:
>>
>>>> -----Original Message-----
>>>> From: rsyslog-bounces at lists.adiscon.com
>>>> [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of david at lang.hm
>>>>
>>>> On Fri, 1 May 2009, Rainer Gerhards wrote:
>>>>
>>>>> Please let me know if you also find a math model useful
>>>> (but I'll probably
>>>>> need to do it in any case, because it helps me clean up my mind...).
>>>>
>>>> I think it will help clarify things a lot. with a good model
>>>> we won't have
>>>> misunderstandings about what we are talking about.
>>>
>>> Yes - and I also think that with the model some complexities disappear. I
>>> think (hope I am right) the solution will become obvious. I know I am
>>> investing a lot of time in a tiny portion of the code, but this is one of
>> the
>>> core elements involving many complexities.
>>>
>>>> with my 'binary search' approach, handling permanently bad
>>>> messages could
>>>> be as simple as 'too many retries once we hit a batch size of
>>>> 1' (with a
>>>> possible option of the output module reporting back that it dectected
>>>> something that makes retries useless, but this is just an
>>>> optimization)
>>>
>>> Yes, indeed. One quick thought: I see a batch as a set of (msg, state)
>>> ordered pairs. Once we have procssed it in one action (all of them have
>>> entered one permanent state), we can than build a subset that we use as
> the
>>> new (remaining) batch in the backup actions. So the "bad record search"
> is
>>> "just" one facet of many that we need to handle with little and hopefully
>>> simple code (doing it with 2000 LoC would be rather easy ;)).
>>
>> I agree with the definition of a batch. Let's see what different states
>> you are thinking of.
>>
>> I am currently assuming that the messages stay in the queue (with the
>> state attached) so that if rsyslog restarts (assuming disk queues), it
>> will realize that the message hasn't been delivered and try again.
>
> No, it is different: the batch is actually dequeued. So if at that point we
> have a system power failure (for whatever reason), the messages are lost.
> While the rsyslog engine intends to be very reliable, it is not a complete
> transactional system. A slight risk remains. For this, you need to understand
> what happens when the batch is processed. I assume that we have no sudden,
> untrappable process termination. Then, if a batch cannot be processed, it is
> returned back to the top of queue. This is not yet implemented, but is how
> single messages (which you can think of an abstraction of a batch in the
> current code) are handled. If, for example, the engine shuts down, but an
> action takes longer than the configured shutdown timeout, the action is
> cancelled and the queue engine reclaims the unprocessed messages. They go
> into a special area inside the .qi file and are placed on top of the queue
> once the engine restarts.
>
> The only case where this not work is sudden process termination. I see two
> cases:
>
> a) a fatal software bug
> We cannot really address this. Even if the messages were remaining in the
> queue until finally processed, a software bug (maybe an invalid pointer) may
> affect the queue structures at large, possibly even at the risk of total loss
> of all data inside that queue. So this is an inevitable risk.
>
> b) sudden power fail
> ... which can and should be mitigated at another level
>
> One may argue that there also is
>
> c) admin error
> e.g, kill -9 rsyslogd
> Here a fully transactional queue will probably help.
>
> However, I do not think that the risk involved justifies a far more complex
> fully transactional implementation of the queue object. Some risk always
> remains (what in the disaster case, even with a fully transactional queue?).
>
> And it is so complex to let the messages stay in queue because it is complex
> to work with such messages and disk queues. It would also cost a lot of
> performance, especially when done reliably (need to sync). We would then need
> to touch each element at least four times, twice as much as currently. Also,
> the hybrid disk/memory queues become very, very complex. There are more
> complexities around this, I just wanted to tell the most obvious.
>
> So, all in all, the idea is that messages are dequeued, processed and put
> back to the queue (think: ungetc()) when something goes wrong. Reasonable
> (but not more) effort is made to prevent message loss while the messages are
> in unprocessed state outside of the queue.
>
> Hope that clarifies and I am glad you brought this up. Made me think again,
> but I concluded to what I've written above ;)
this is definantly different from the way I thought things worked from our
prior discussions about reliability. from those I understood that rsyslog
could be used to make a fully reliable system, if you are willing to take
the performance hit to do so.
as batch size increases (to gain efficiancy) the number of log messages
that can be lost also increase.
unfortunantly I have the belief that power outages cannot be avoided (I've
seen cases where millions have been spent on the power systems and still
ended up with a datacenter-wide blackout.
when you get the model of things togeather we will be in a much better
position to discuss this. it's 1:20am here and I'm ready to collapse.
David Lang
More information about the rsyslog
mailing list