[rsyslog] untra-reliable speed test
david at lang.hm
david at lang.hm
Fri May 8 11:07:52 CEST 2009
On Fri, 8 May 2009, Rainer Gerhards wrote:
> On Fri, 2009-05-08 at 01:18 -0700, david at lang.hm wrote:
>> On Fri, 8 May 2009, Rainer Gerhards wrote:
>>
>>>> -----Original Message-----
>>>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
>>>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
>>>>
>>>> I have a box put togeaterh for a first cut at a speed test of rsyslog in
>>>> untra-reliable mode. the outline below is intended to minimize the number
>>>> of variables.
>>>>
>>>> the box is a dual quad-core opteron with 8G of ram, one SATA drive and a
>>>> fusionIO SSD PCIE drive, currently running RHEL 5.3 kernel 2.6.18-53
>>>> (redhat stock kernel) I intend to format the SSD with ext2 (as the
>>>> application is providing data integrity, and to avoid the known
>>>> performance problems with ext3 and fsync)
>>>
>>> Just a question, because I do not know enough about ext2: does ext2 guarantee
>>> that when an application does fsync, all data, INCLUDING related file system
>>> control structures are written to disk? Or, to phrase it the other way
>>> around, can ext2 guarantee that fsync'ed data can always be read after a
>>> power failure. I think along the lines of some control structures not being
>>> written, thus the fsynced app data may be present on the disk, but cannot be
>>> accessed any longer. In the worst case, would it be possible that a whole
>>> file be lost during a file system check after reboot?
>>>
>>> My *uneducated* understanding is that ext3 does guard against this (thus the
>>> performance problems) but ext2 does not.
>>
>> the performance problem with ext3 is that it forces ALL pending writes to
>> disk when anything does a fsync
>>
>> now that you mention it, I think that with all filesystems other than ext2
> I think you meant ext3 here?
>
>> you need to do a fsync on the directory as well as on the file
>>
>
> another uneducated question: does that ensure that all fs control
> structures be written? I mean things like the chain that links file
> parts together. My understanding is the answer is "yes", but I prefer to
> ask as I am not 100% sure.
yes, if you do a fsync on the file and on the directory the file is in you
are absolutly safe. this is what the good mail servers do when recieving a
message.
if the file size does not change (say you pre-allocate the file, or are
overwriting a file, like you could be doing for a queue) you don't have to
do the fsync on the directory.
>>> If my understanding would be correct (and I don't say so), we would need to
>>> use ext3.
>>
>> I'll try both (and later on, when I use by own kernel rather than the
>> redhat one I'll also test XFS)
>>
>> I think that if no other disk activity is taking place ext3 maynot be too
>> bad (one other advantage that ext2 would have over ext3 and XFS is that
>> journaling filesystems have to write whatever they journal twice (once to
>> the journal and once to the final location)
> ack
>
>>
>>>> for the rsyslog test I am thinking the following
>>>>
>>>> useing rsyslog 4.1.7
>>>> enable input file
>>>
>>> Not sure if I got this bullet point right. Do you mean you intend to use
>>> imfile for input generation?
>>
>> yes, that was my intent. just to simplify things by making the test
>> completely self contained to the one box.
>
> there is a kind of interaction between imfile and the queue in that
> imfile flags its messages as "delayable", which was introduced to
> prevent imfile unnecessarily putting data too fast into the queue. But
> on the other hand, this should tune the system to the actual max rate
> (at least in theory).
>
>>
>>> In any case, I would suggest to do a test with UDP and one with TCP senders,
>>> both sending at maximum rate. With UDP, we would see a message loss rate,
>>> while with TCP we would see the actual number of messages that the system can
>>> process. So TCP is probably the more meaningful number, but packet loss rate
>>> for UDP - a common use case - would also be interesting, at least I think so.
>>
>> will do.
>>
>> I will be interested in seeing the UDP loss rate, I suspect that with
>> appropriate OS tuning I can get it down to zero loss rate at the data
>> rates that the rest of the system maintains (the OS has a buffer prior to
>> rsyslog's input process that can cover delays on the input threads)
>
> Let's say you find out the max rate R via e.g. TCP, and then use R as an
> upper bound of the UDP traffic, that should work. But I would also find
> it interesting to see how many messages are dropped if you send at a
> rate >> R. I would not be surprised if the resulting commit rate would
> be (even far) below R.
it depends on where things get dropped. if I send enough UDP packets to
flood the OS buffer, it will drop the packets and rsyslog will never know
that they existed.
below that, when rsyslog has a full queue and there is lock contention
between the thread trying to insert messages into the queue and the thread
pulling messages out of the queue it does slow down. I don't know if that
will be visable on the disk-based queue, but it was _very_ visable on the
memory based queue.
>>
>>>> set the main queue mode to disk
>>>> enable fsyncs everwhere
>>>
>>>
>>> Just as a reminder: this includes $MainMsgQueueCheckpointInterval 1 (which is
>>> a *real* performance eater and puts a lot of burden on the consistency of the
>>> file system's control structures, thus my question on ext2 vs. ext3 above).
>>
>> does this do a fsync on the directory.
>
> No! But I think it would be easy to add (but easy only in a
> non-optimized way, optimization would take more effort).
I'll test as-is, and if the numbers are high enough to be interesting,
we'll hack that in and see how badly it hurts us (to drive things in a
worst-case way)
>>
>>>> set the output to log *.* to a file
>>>>
>>>> run a cron job that rolls the log file once a min and sends a HUP to
>>>> rsyslog
>>>>
>>>> create a large file of log information
>>>>
>>>> run this for a while and then count the number of logs in each rolled log
>>>> file. hopefully the number will be reasonably consistant.
>>>>
>>>> does this sound like a reasonable approach? or is this going to not be
>>>> representitive for some reason?
>>>
>>> With the few comments above, I think this is a very reasonable approach and
>>> should provide very good insight.
>>>
>>> Actually, I hope that it can prove my point that this setup is too slow
>>> wrong...
>>
>> there will definantly be a performance issue at some point here, the
>> question is if it's fast enough to be useable.
>>
>> the drive claims to be able to do >100,000 I/O ops/sec. if we can manage
>> to get a few thousand logs/sec written on this, it will be extremely
>> usable.
>
> OK, a "few thousand" is not what I have on my mind for a
> high-performance system (a "few ten-thousand), but I agree that it can
> be considered a busy system. So a "few thousand" (maybe more than
> 5,000?) should be sufficient to prove the original point - especially as
> harware gets faster AND you can use solid state disks or similar
> mechanisms (if assuming they qualify for the reliability criteria).
I'm a bit amused by this criteria. IIRC, when I started playing with
rsyslog before any of the performance improvements were done, wasn't this
the best data rate that you could get out of rsyslog with a ram-based
queue?
i know that with two outputs (disk + relay) I was only getting ~30,000
messages/sec. (with disk only output it could get up to ~80,000)
also note that these tests are being done on the version _without_ batch
processing. I need to think about it a bit more to be sure there aren't
any holes in my thinking, but I believe that you would only need to do one
set of fsyncs per batch that's processed. so setting a batch size of 100
should increase the messages/sec by a similar factor.
this is only on the output side for now, but if this proves to be
interesting, some inputs could batch as well (from your comments it sounds
as if relp can send a batch of messages and then get acknowledgement of
all of them at once, if so, that could serve as the input)
> One thing we need to think about is burst traffic rate, especially with
> UDP. I tend to think that such a system must be able to support UDP
> traffic, too (what is a questionable opinion) and, if so, we must not
> only look at the sustained but even more at the burst rate.
yes and no. while I see the need to support UDP, it's not going to be
reliable (the Os bufferes them before they get to the system, ignoring the
network ability to drop them), and if you really need high UDP burst rates
you could run two copies of rsyslog, one ultra-reliable (with reliable
inputs), and a second one with a memory queue, feeding into the
ultra-reliable one with a batched input method.
but it will be good to see where the limits are.
> As I side-note, you will probably see that the disk queue can be
> optimized. If sufficient effort is made, I think it can perform at least
> perform faster at a factor of four to six. The reason is that it was
> never really meant to be used on a busy box in this way. While knowing
> this, we should not start a new discussion about these optimizations,
> simply because they take considerable additional time and we can not fit
> that part into anything we have on our mind for the forseable future.
yeah, I've been thinking of various things that could be done here, but I
won't ask about any of them for now ;-)
David Lang
More information about the rsyslog
mailing list