[rsyslog] ultra-reliable speed test
Rainer Gerhards
rgerhards at hq.adiscon.com
Fri May 8 13:28:08 CEST 2009
On Fri, 2009-05-08 at 02:07 -0700, david at lang.hm wrote:
> > another uneducated question: does that ensure that all fs control
> > structures be written? I mean things like the chain that links file
> > parts together. My understanding is the answer is "yes", but I prefer to
> > ask as I am not 100% sure.
>
> yes, if you do a fsync on the file and on the directory the file is in you
> are absolutly safe. this is what the good mail servers do when recieving a
> message.
>
> if the file size does not change (say you pre-allocate the file, or are
> overwriting a file, like you could be doing for a queue) you don't have to
> do the fsync on the directory.
>
thanks and very good to know
> >
> > Let's say you find out the max rate R via e.g. TCP, and then use R as an
> > upper bound of the UDP traffic, that should work. But I would also find
> > it interesting to see how many messages are dropped if you send at a
> > rate >> R. I would not be surprised if the resulting commit rate would
> > be (even far) below R.
>
> it depends on where things get dropped. if I send enough UDP packets to
> flood the OS buffer, it will drop the packets and rsyslog will never know
> that they existed.
that's what I am thinking (and concerned) about
> below that, when rsyslog has a full queue and there is lock contention
> between the thread trying to insert messages into the queue and the thread
> pulling messages out of the queue it does slow down. I don't know if that
> will be visable on the disk-based queue, but it was _very_ visable on the
> memory based queue.
>
> >>
> >>>> set the main queue mode to disk
> >>>> enable fsyncs everwhere
> >>>
> >>>
> >>> Just as a reminder: this includes $MainMsgQueueCheckpointInterval 1 (which is
> >>> a *real* performance eater and puts a lot of burden on the consistency of the
> >>> file system's control structures, thus my question on ext2 vs. ext3 above).
> >>
> >> does this do a fsync on the directory.
> >
> > No! But I think it would be easy to add (but easy only in a
> > non-optimized way, optimization would take more effort).
>
> I'll test as-is, and if the numbers are high enough to be interesting,
> we'll hack that in and see how badly it hurts us (to drive things in a
> worst-case way)
ack
> > OK, a "few thousand" is not what I have on my mind for a
> > high-performance system (a "few ten-thousand), but I agree that it can
> > be considered a busy system. So a "few thousand" (maybe more than
> > 5,000?) should be sufficient to prove the original point - especially as
> > harware gets faster AND you can use solid state disks or similar
> > mechanisms (if assuming they qualify for the reliability criteria).
>
> I'm a bit amused by this criteria. IIRC, when I started playing with
> rsyslog before any of the performance improvements were done, wasn't this
> the best data rate that you could get out of rsyslog with a ram-based
> queue?
>
> i know that with two outputs (disk + relay) I was only getting ~30,000
> messages/sec. (with disk only output it could get up to ~80,000)
>
That's the price you have to pay for educating me ;) You convinced me
that this data rate is too slow for a really busy server, and so I am
now applying that knowledge ;)
> also note that these tests are being done on the version _without_ batch
> processing. I need to think about it a bit more to be sure there aren't
> any holes in my thinking, but I believe that you would only need to do one
> set of fsyncs per batch that's processed. so setting a batch size of 100
> should increase the messages/sec by a similar factor.
I hadn't thought about this, but now that you say it, I agree. Actually,
an fsync per queue lock release would probably be the rigth criterion. I
think that is almost equivalent to what you said, but the advantage of
that definition is that I can simply watch out for these *already
existing* places as a guideline. That can indeed make a considerable
difference.
> this is only on the output side for now, but if this proves to be
> interesting, some inputs could batch as well (from your comments it sounds
> as if relp can send a batch of messages and then get acknowledgement of
> all of them at once, if so, that could serve as the input)
That's a sliding window, but this is something that really does not
belong into the app layer (and is not visible their). It is the same
thing as the tcp sliding window, which you know to exist but do not know
any specifics of.
Even if we would make the relp sliding window visible to the app layer,
it wouldn't provide much benefit. The only I can think of is lock
contention but with the queue workers acquiring the lock now only once
per batch, the probability is greatly reduced.
> > One thing we need to think about is burst traffic rate, especially with
> > UDP. I tend to think that such a system must be able to support UDP
> > traffic, too (what is a questionable opinion) and, if so, we must not
> > only look at the sustained but even more at the burst rate.
>
> yes and no. while I see the need to support UDP, it's not going to be
> reliable (the Os bufferes them before they get to the system, ignoring the
> network ability to drop them), and if you really need high UDP burst rates
> you could run two copies of rsyslog, one ultra-reliable (with reliable
> inputs), and a second one with a memory queue, feeding into the
> ultra-reliable one with a batched input method.
ack - as I said, the opinion is questionable... But what if you have
important devices that simply do not speak anything else but UDP (they
still seem to exist...).
However, think of it that way:
You limit the max burst rate by using an ultra-reliable queue. You do
so, because you do not want to lose messages when a sudden power failure
occurs. To support that configuration, you need to run the second
instance. It queues in memory until the (slower) reliable rsyslogd can
now accept the message and put it into the reliable queue. Let's say
that you have a burst of r messages and that from these burst only r/2
can be enqueued (because the ultra reliable queue is so slow). So you
lose r/2 messages.
Now consider the case that you run rsyslog with just a reliable queue,
one that is kept in memory but not able to cover the power failure
scenario. Obviously, all messages in that queue are lost when power
fails (or almost all to be precise). However, that system has a much
broader bandwidth. So with it, there would never have been r messages
inside the queue, because that system has a much higher sustained
message rate (and thus the burst causes much less of trouble). Let's say
the system is just twice as fast in this setup (I guess it usually would
be *much* faster). Than, it would be able to process all r records.
In that scenario, the ultra-reliable system loses r/2 messages, whereas
the somewhat more "unreliable" system loses none - by virtue of being
able to process messages as they arrive.
Now extend that picture to messages residing inside the OS buffers or
even those that are still queued in their sources because a stream
transport blocked sending them.
I know that each detail of this picture can be argued at length about.
However, my opinion is that there is no "ultra-reliable" system in life,
only various probabilities in losing messages. These probabilities
often depend on each other, what makes calculating them very hard to
impossible. Still, the probability of message loss in the system at
large is just the product of the probabilities in each of its
components. And reliability is just the inverse of that probability.
This is where *I* conclude that it can make sense to permit a system to
lose some messages under certain circumstances, if that influences the
overall probability calculation towards the desired end result. In that
sense, I tend to think that a fast, memory-queuing rsyslogd instance can
be much more reliable compared to one that is configured as being
ultra-reliable, where the rest of the system at large is badly
influenced by this (the scenario above).
However, I also know that for regulatory requirements, you often seem to
need to prove that a system may not lose messages once it has received
them, even at the cost of an overall increased probability of message
loss.
My view of reliability is much the same as my view of security: there is
no such thing as "being totally secure", you can just reduce the
probability that something bad happens. The worst thing in security is
someone who thinks he is "totally secure" and as such is no longer
actively looking at potential issues.
The same I see for reliability. There is no thing like "being totally
reliable" and it is a really bad idea to think you could ever be.
Knowing this, one may begin to think about how to decrease the overall
probability of message loss AND think about what rate is acceptable (and
what to do with these cases, e.g. "how can they hurt").
... but ... enough of philosophy, I am not sure if it helps this
discussion ;) (but I thought it is useful to "see" what I have on my
mind when talking about these things).
> > As I side-note, you will probably see that the disk queue can be
> > optimized. If sufficient effort is made, I think it can perform at least
> > perform faster at a factor of four to six. The reason is that it was
> > never really meant to be used on a busy box in this way. While knowing
> > this, we should not start a new discussion about these optimizations,
> > simply because they take considerable additional time and we can not fit
> > that part into anything we have on our mind for the forseable future.
>
> yeah, I've been thinking of various things that could be done here, but I
> won't ask about any of them for now ;-)
Oh yes, a broad range. Simple things like zipping the data and keeping
all handles always open to complex things like a dedicated,
random-accesss, database-like disk queue store (being even
preformatted). If you look at the code, you'll possibly notice that the
disk queue system uses stream drivers to persist the data. This would be
the hook to extend.
... but: that's a story for another quarter ;)
Thanks again for your careful thought-out comments, they really help in
getting things right.
Rainer
>
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
More information about the rsyslog
mailing list