[rsyslog] abort in 4.2.1
david at lang.hm
david at lang.hm
Tue Aug 25 16:19:48 CEST 2009
On Tue, 25 Aug 2009, Rainer Gerhards wrote:
> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote:
>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies
>>> under load from malformed messages.
>>
>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may
>> just be that the race condition to cause the crash is smaller, 5.x is
>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec,
>> writing them locally and relaying them to another machine eats up <2% cpu
>> according to top)
>>
>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu)
>
> The bad thing about debug mode is that not only it is slower, but it
> introduces some synchronization. So race bugs frequently disappear when
> debug mode is turned on. Anyhow, sometimes they persist and then the
> debug log often provides good information (aka "definitely worth a
> try" ;)).
>
> I did some basic testing with the malformed message you provided in an
> earlier message, but I unfortunately did not see anything that is not
> clean. I am still a bit of the assumption that the malformednes of the
> message is not a necessary condition for the segfault - but that needs
> to be seen. No abort happened (yet) in my lab.
I did finally get it to die, as soon as I get into the office I'll look at
the end of the debug log
the box I am duplicating this problem on relays all the logs it recieves
up to another central box. the logs that come through this box are about a
tenth of the total logs that the central box gets, and that central box
has had no problems.
the things that I see as being different are
1. the central box doesn't see the malformed messages (one of the relay
boxes would fix that before forwarding it)
2. there are fewer systems sending simultaniously to the central box
(there are ~100 boxes sending to the relay that dies, but only a half
dozen relay boxes sending to the central box)
two of the other relays handle a _far_ higher rate of logs, but from fewer
sources (one has one source that spews ~15G of logs/day, the other
recieves ~100m logs/day from 6 machines). a third relay has more machines
sending it logs, but at a lower rate than those two (but still
significantly higher than the one that fails). if there was a problem with
load or the number of messages being recieved simultaniously I would
expect one of these other three to have more problems than the one that
fails on me.
3. a noticable fraction of the logs sent through this relay box are sent
by a cron job running on each of ~60 machines that wakes up every min and
scrapes a local file, sending all the pending messages, so the incoming
messages are a bit burstier than normal, the relaying is still bursty, but
it is only one bursty box, not many
note that even if this cron job is stopped I still had 4.2.1 die on this
relay box, so I don't think that it's the bursty nature of the traffic
this is why I'm suspicious of the malformed message handling
David Lang
More information about the rsyslog
mailing list