[rsyslog] output plugin calling interface

Rainer Gerhards rgerhards at hq.adiscon.com
Sun May 3 11:13:36 CEST 2009


> -----Original Message-----
> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
> Sent: Sunday, May 03, 2009 2:42 AM
> To: rsyslog-users
> Subject: Re: [rsyslog] output plugin calling interface
> 
> On Sat, 2 May 2009, Rainer Gerhards wrote:
> 
> > After a lot of thinking today, we can have a "kind of" transactional
queue,
> > but we need to accept potential message *duplication* in the event of
> > failures (but no loss).
> 
> this is the approach that you have taken for other things (relp for
> example), and when we were discussing reliability for direct mode vs disk
> queues you mentioned that rsyslog could duplicate messages in case of
> failures, but would not loose messages.

I think I always mentioned that the currently processed message is at risk.
We are drawing a very fine line here, I think, because in case of a fatal
failure, we always end up with some uncertainty. Just  think about the fact
that all sources besides RELP are much more unreliable than the engine
itself. So, compared to the environment in which rsyslog is intended to work,
I think it is far more reliable than the whole rest of said system. So I am
somewhat hesitant to put *a lot* of effort into a small part of reliability
which can, if looked at the overall picture, not really be utilized.

Just let me iterate that we still talk about a total failure situation. So if
the whole data center looses power, it is quite random what we receive. It is
also quite random if the receiver manages to enqueue the message, if it
receives it. It is uncertain if the disk subsystem (at the controller level)
manages to complete the disk transaction.

Let's assume the data center's power subsystem emits a failure message while
it dies. Obviously, this one would be especially important to save. To do so,
we must be lucky enough so that

a) power system can emit a valid message
b) network components have long enough power to send message to rsyslogd
c) rsyslog input runs long enough to actually receive the message (think OS
reception queue)
d) rsyslog input can parse message
e) parsed message can be handed over to queue subsystem
f) queue subsystem can commit message to disk

All of this needs to happen to ensure the message is saved. I'd say there is
a lot of potential that we lose a message along that path. But now let's
consider this all somehow works. Then this happens:

g) queue subsystem dequeues message from disk (now it is in danger again)
h) message is run through filter engine
i) action is carried out (assuming a direct queue)

The probability of message loss now depends mostly on i) if that is a quick
action (like write to disk), the probability is very low. If it is an action
that connects to the network (e.g. database, forwarding), I'd say the
probability tends to 1. So, yes, in this case the message will most probably
be lost.

However, the relative probability of loss depends on the the probability that
a) to f) succeed, which I consider to be very low. And it also depends on a
number of other factors. E.g. the OS, if it is notified of the power fail,
will try to ensure system consistency as much as possible and thus will
probably not turn back into user space during emergency processing, reducing
the probability of step c) towards 0.

The question now, IMHO, is how important is it to ensure that these very
limited message loss potential is actually considered. But I agree that with
batches, the magnitude of the problem increases and such the additional
relative probability of message loss may increase enough to justify looking
at the issue.

> 
> > This would work without a two-phase commit. However,
> > there still is considerable effort to implement it.
> 
> as I understand things the current process is
> 
> thread A recieves the message and puts it in the Queue
> 
> worker thread B pulls the message from the queue formats it and puts it in
> the action queue (if there is no action queue, this triggers the output
> modulecode as part of thread B.)
> 
> if there is an action queue, thread C is running, and does basicly the
> same thing that thread B would do if there was no action queue
>

Right!
 
> 
> what I am envisoning is that the worker thread would touch the queue one
> additional time.
> 
> instead of removing the message from the queue to perform the action it
> would mark the message as being 'in process', then after the message is
> delivered it would delte it from the queue (touching the queue three times
> instead of two)
> 

Yes, full ack. I am thinking along these lines, too and it is good to hear
that someone independently of me does, too.

However, while this sounds very simple, there are a lot of subtle issues.
Just think about the different queue modes. Especially the transition from
memory-only to hybrid mode (and hybrid mode at all) for DA queues brings a
lot of potential trouble spots.

There is also a performance price to pay with the additional reliability we
get.

> 
> > I wonder if the use case
> > actually justifies it. Please also consider what I wrote below on the
> > performance of any ultra-reliable version. And, yes, I know we have fast
and
> > reliable controllers today, but even then the disk path is much, much
slower
> > than any memory based queue. I fail to believe you can build a very
> > high-performance syslog server on a disk queue, even with the best
hardware
> > money can buy today.
> 
> I'm going to be testing this shortly ;-) I have a fusion IO drive to try
> and will be getting some boxes with the Intel X-25E SSD drives in a couple
> of weeks. the only thing I can't try is the ram-based drive.

That would be very good to know. So far, I have to admit, I fail to convince
myself that a disk-only configuration can be used for a high volume system.
If that would be the case (let's assume that for a moment), we would need to
run any such system with at least a DA queue, so one that relies on messages
being held in memory at least for part of their lifetime. If that is true,
all the discussion about relative loss probabilities is irrelevant, because
if we have n messages exclusively in an in-memory queue, and we have a sudden
power loss, we surely lose all these n messages. 

So I conclude any motivation to try prevent even the slightest loss - in case
of a total power loss (then and only then my opinion applies) - depends on
the ability to run a high-volume system in pure disk mode. If that's
possible, I agree preventing the loss is useful. If you cannot use pure disk
mode, you can not totally prevent loss and there is no point in trying to
minimize an effect in a situation that never occurs.

I have to give a training next Monday and Tuesday, so I may not be as
responsive. But I'll continue to think about the whole issue today.

Feedback on the "high-volume disk only" case would be most welcome. Actually,
I'd really love to be proven wrong (not only by the actual hardware results
[which vary over the years], but by an error in my argument), so please do.
If you can, we can probably build a much more reliable system than I
envisioned. So far, my reliability picture does not include disaster cases.
Where, of course, power failure is just the mildest facet of them.

Rainer

> 
> David Lang
> 
> > Rainer
> >
> >> -----Original Message-----
> >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> >> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards
> >> Sent: Saturday, May 02, 2009 10:33 AM
> >> To: rsyslog-users
> >> Subject: Re: [rsyslog] output plugin calling interface
> >>
> >>> -----Original Message-----
> >>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> >>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
> >>> Sent: Saturday, May 02, 2009 10:21 AM
> >>> To: rsyslog-users
> >>> Subject: Re: [rsyslog] output plugin calling interface
> >>>
> >>> On Sat, 2 May 2009, Rainer Gerhards wrote:
> >>>
> >>>>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> >>>>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
> >>>>>
> >>>>> On Fri, 1 May 2009, Rainer Gerhards wrote:
> >>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: rsyslog-bounces at lists.adiscon.com
> >>>>>>> [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of
> > david at lang.hm
> >>>>>>>
> >>>>>>> On Fri, 1 May 2009, Rainer Gerhards wrote:
> >>>>>>>
> >>>>>>>> Please let me know if you also find a math model useful
> >>>>>>> (but I'll probably
> >>>>>>>> need to do it in any case, because it helps me clean up my
> > mind...).
> >>>>>>>
> >>>>>>> I think it will help clarify things a lot. with a good model
> >>>>>>> we won't have
> >>>>>>> misunderstandings about what we are talking about.
> >>>>>>
> >>>>>> Yes - and I also think that with the model some complexities
> > disappear.
> >> I
> >>>>>> think (hope I am right) the solution will become obvious. I know I
am
> >>>>>> investing a lot of time in a tiny portion of the code, but this is
> > one
> >> of
> >>>>> the
> >>>>>> core elements involving many complexities.
> >>>>>>
> >>>>>>> with my 'binary search' approach, handling permanently bad
> >>>>>>> messages could
> >>>>>>> be as simple as 'too many retries once we hit a batch size of
> >>>>>>> 1' (with a
> >>>>>>> possible option of the output module reporting back that it
> > dectected
> >>>>>>> something that makes retries useless, but this is just an
> >>>>>>> optimization)
> >>>>>>
> >>>>>> Yes, indeed. One quick thought: I see a batch as a set of (msg,
> > state)
> >>>>>> ordered pairs. Once we have procssed it in one action (all of them
> > have
> >>>>>> entered one permanent state), we can than build a subset that we use
> > as
> >>>> the
> >>>>>> new (remaining) batch in the backup actions. So the "bad record
> > search"
> >>>> is
> >>>>>> "just" one facet of many that we need to handle with little and
> >> hopefully
> >>>>>> simple code (doing it with 2000 LoC would be rather easy ;)).
> >>>>>
> >>>>> I agree with the definition of a batch. Let's see what different
> > states
> >>>>> you are thinking of.
> >>>>>
> >>>>> I am currently assuming that the messages stay in the queue (with the
> >>>>> state attached) so that if rsyslog restarts (assuming disk queues),
it
> >>>>> will realize that the message hasn't been delivered and try again.
> >>>>
> >>>> No, it is different: the batch is actually dequeued. So if at that
> > point
> >> we
> >>>> have a system power failure (for whatever reason), the messages are
> > lost.
> >>>> While the rsyslog engine intends to be very reliable, it is not a
> >> complete
> >>>> transactional system. A slight risk remains. For this, you need to
> >>> understand
> >>>> what happens when the batch is processed. I assume that we have no
> >> sudden,
> >>>> untrappable process termination. Then, if a batch cannot be processed,
> > it
> >> is
> >>>> returned back to the top of queue. This is not yet implemented, but is
> >> how
> >>>> single messages (which you can think of an abstraction of a batch in
> > the
> >>>> current code) are handled. If, for example, the engine shuts down, but
> > an
> >>>> action takes longer than the configured shutdown timeout, the action
is
> >>>> cancelled and the queue engine reclaims the unprocessed messages. They
> > go
> >>>> into a special area inside the .qi file and are placed on top of the
> >> queue
> >>>> once the engine restarts.
> >>>>
> >>>> The only case where this not work is sudden process termination. I see
> >> two
> >>>> cases:
> >>>>
> >>>> a) a fatal software bug
> >>>> We cannot really address this. Even if the messages were remaining in
> > the
> >>>> queue until finally processed, a software bug (maybe an invalid
> > pointer)
> >> may
> >>>> affect the queue structures at large, possibly even at the risk of
> > total
> >>> loss
> >>>> of all data inside that queue. So this is an inevitable risk.
> >>>>
> >>>> b) sudden power fail
> >>>> ... which can and should be mitigated at another level
> >>>>
> >>>> One may argue that there also is
> >>>>
> >>>> c) admin error
> >>>> e.g, kill -9 rsyslogd
> >>>> Here a fully transactional queue will probably help.
> >>>>
> >>>> However, I do not think that the risk involved justifies a far more
> >> complex
> >>>> fully transactional implementation of the queue object. Some risk
> > always
> >>>> remains (what in the disaster case, even with a fully transactional
> >> queue?).
> >>>>
> >>>> And it is so complex to let the messages stay in queue because it is
> >> complex
> >>>> to work with such messages and disk queues. It would also cost a lot
of
> >>>> performance, especially when done reliably (need to sync). We would
> > then
> >>> need
> >>>> to touch each element at least four times, twice as much as currently.
> >> Also,
> >>>> the hybrid disk/memory queues become very, very complex. There are
more
> >>>> complexities around this, I just wanted to tell the most obvious.
> >>>>
> >>>> So, all in all, the idea is that messages are dequeued, processed and
> > put
> >>>> back to the queue (think: ungetc()) when something goes wrong.
> > Reasonable
> >>>> (but not more) effort is made to prevent message loss while the
> > messages
> >> are
> >>>> in unprocessed state outside of the queue.
> >>>>
> >>>> Hope that clarifies and I am glad you brought this up. Made me think
> >> again,
> >>>> but I concluded to what I've written above ;)
> >>>
> >>> this is definantly different from the way I thought things worked from
> > our
> >>> prior discussions about reliability. from those I understood that
rsyslog
> >>> could be used to make a fully reliable system, if you are willing to
take
> >>> the performance hit to do so.
> >>
> >> You can, but than you need to use batch sizes of 1.
> >>
> >>> as batch size increases (to gain efficiancy) the number of log messages
> >>> that can be lost also increase.
> >>>
> >>> unfortunantly I have the belief that power outages cannot be avoided
> > (I've
> >>> seen cases where millions have been spent on the power systems and
still
> >>> ended up with a datacenter-wide blackout.
> >>
> >> Let me think about this, but I think to protect against this problem,
you
> >> really need to have two-phase commit, which I am not sure belongs into a
> >> syslogd.
> >>
> >>> when you get the model of things togeather we will be in a much better
> >>> position to discuss this.
> >>
> >> Well, we'd probably restart discussing reliability requirements. If it
> > turns
> >> out that you need 100% reliability, not matter what happens at all, I am
> > not
> >> sure if we can implement this without adding considerable database-ish
> >> processing. "Under all circumstances" reliability is very hard to
achive,
> >> especially if you also would like to have high performance. Think about
it:
> >> to guard against the data center full power loss scenario, you need to
have
> > a
> >> disk-only queue, being synced to disk for every single en- and dequeue
> >> operation. This is extremely costly. Does it than really matter if we
have
> >> large batches or not? The system, I think, will be so slow, that you
cannot
> >> use it for any demanding real-life application, so some compromise
between
> >> speed and reliability, I think, must be made in any case.
> >>
> >>> it's 1:20am here and I'm ready to collapse.
> >>
> >> I hadn't even expected this response at this time ;)
> >>
> >> Rainer
> >>>
> >>> David Lang
> >>> _______________________________________________
> >>> rsyslog mailing list
> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>> http://www.rsyslog.com
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com



More information about the rsyslog mailing list