[rsyslog] Queuing bug (4.5.5)
Aaron Wiebe
epiphani at gmail.com
Tue Nov 3 18:27:07 CET 2009
Ok - I captured an strace. This appears to be the action thread.
This specific set of calls took minutes. I'll re-run this with -t
and/or -T.
Note that this syslog instance is not actually receiving any data right now.
-Aaron
26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...>
26365 <... futex resumed> ) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished ...>
26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...>
26365 <... futex resumed> ) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished ...>
26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...>
26365 <... futex resumed> ) = 0
26365 clock_gettime(CLOCK_REALTIME, <unfinished ...>
26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 <... futex resumed> ) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished ...>
26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...>
26365 <... futex resumed> ) = 0
26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...>
26365 <... futex resumed> ) = -1 EAGAIN (Resource
temporarily unavailable)
26365 clock_gettime(CLOCK_REALTIME, <unfinished ...>
26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 <... futex resumed> ) = 0
26365 clock_gettime(CLOCK_REALTIME, <unfinished ...>
26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished ...>
26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335
26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...>
On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards
<rgerhards at hq.adiscon.com> wrote:
>> -----Original Message-----
>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm
>> Sent: Tuesday, November 03, 2009 4:06 PM
>> To: rsyslog-users
>> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>>
>> On Tue, 3 Nov 2009, Aaron Wiebe wrote:
>>
>> > This is still taking place this hour - though I obviously can't
>> > restart onto a newer version without losing this case. Is there
>> > anything I can do in the current configuration to try and debug this
>> > situation?
>>
>> if you can strace each thread for a few seconds it may give you an idea
>> what's happening.
>
> This is a good suggestion.
>
> It would also potentially be enlightening to attach gdb to the process at
> various points in time and get a backtrace from all running threads.
>
> Other than that, there is little you can do on a running version. If it were
> compiled for debug, and the environment setup so that we could capture
> stdout/stderr, sending SIGUSR2 would yield to much the same information as
> the gdb backtrace BUT when runtime instrumentation is enabled, the build is
> operating at a third to a quarter of its normal speed.
>
>> in the meantime you need to stop the system from getting further
>> behind,
>> can you redirect or reconfigure the senders so that they are not
>> sending
>> new logs to this box so that it can dig itself out (stopping the input
>> may
>> be enough by itself, rsyslog has historicly done a LOT of locking
>> around
>> the main queue, and if you have a full queue with more data trying to
>> be
>> delivered it can really slow things down. I wouldn't expect it to be
>> this
>> much, but it could be part of it)
>>
>
> This may clean up things, but I really doubt it, given the magnitude of
> delays. Also, Aaron runs 4.4.5, which already has lots of the locking
> removed/restructure (not to compare with the recent v5-devel, but much more
> effcient than early v4 and v3).
>
> Rainer
>> David Lang
>>
>> > (We're up to 18:46:51 now!)
>> >
>> > -Aaron
>> >
>> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards
>> > <rgerhards at hq.adiscon.com> wrote:
>> >> mhhh... This is obviously not intended behavior. There are some
>> rate-limiting
>> >> features that can deliberately slow down a queue, but I guess you
>> have not
>> >> configured these. So there either is a bug that activates them at
>> some point
>> >> during processing (e.g. an invalid memory access could do that) or
>> there is
>> >> some other bug that causes the dequeue to happen very slowly. In any
>> case, it
>> >> is hard to guess.
>> >>
>> >> Given the volume of the messages, a debug log probably does not
>> help. We
>> >> could gain a little insight in at least the queue sizes that rsyslog
>> sees via
>> >> imdiag plus the (very rudamentary) v5 debug front-end (it doesn't
>> require the
>> >> engine to be v5!). I would need to spend at least a little work on
>> the
>> >> front-end to enable remote access, but that's not really a problem.
>> >>
>> >> One other thing is that I am still holding a v4-beta with additional
>> fixes as
>> >> I didn't want to put these in the middle of some other debugging.
>> However,
>> >> these fixes address potential memory problems, so the most
>> appropriate course
>> >> of action would be to give that version at least a try. It needs to
>> be
>> >> released in the next days in any case.
>> >>
>> >> I have uploaded that (pre-4.5.6) version so that you can give it a
>> try if you
>> >> like:
>> >>
>> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog-4.5.6.tar.gz
>> >>
>> >> I think it would good if you could try it at least once, because I
>> think any
>> >> other troubleshooting will boil down to looking at the fixes this
>> version
>> >> contains.
>> >>
>> >> Rainer
>> >>
>> >>> -----Original Message-----
>> >>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
>> >>> bounces at lists.adiscon.com] On Behalf Of Aaron Wiebe
>> >>> Sent: Monday, November 02, 2009 11:52 PM
>> >>> To: rsyslog-users
>> >>> Subject: [rsyslog] Queuing bug (4.5.5)
>> >>>
>> >>> Greetings all,
>> >>>
>> >>> I appear to have run into an issue with 4.5.5 where queues are not
>> >>> being flushed in a timely manner. In this specific case, I have
>> data
>> >>> from last wednesday that is being written to disk in small chunks
>> >>> today since last wednesday. Unfortunately I cannot be too specific
>> in
>> >>> details, but here is what I can see:
>> >>>
>> >>> Using omfile+gzip, there appears to have been a decent burst in
>> >>> traffic sometime last wednesday. The rsyslog instance has grown to
>> >>> 1.8GB of memory and stayed there for a while. I have both main
>> >>> message and action queues defined using fixedarray, and I see no
>> >>> on-disk queues (appears to be entirely in memory).
>> >>>
>> >>> I've got templates writing out to filenames using an hourly
>> timestamp
>> >>> (filenames like: [token]-[host]-YYYYMMDD-HH.txt.gz) In some of
>> those
>> >>> files, all of them less than 5k in size, there are between 5 and 15
>> >>> lines of content, all of them from last wednesday, and within a few
>> >>> seconds of each other. It's almost like there was a significant
>> queue
>> >>> built up, the hour rolled over, and only the first block of lines
>> were
>> >>> pulled from the queue. Then the hour rolled over again, and
>> another
>> >>> block of lines were pulled from the queue. Then the next hour,
>> then
>> >>> another 5-15 lines.
>> >>>
>> >>> Is it possible that one of the queues still has a good chunk of
>> data
>> >>> built up, and is flushing it out very slowly? It hasn't been
>> >>> consistantly at the top of the hour either, and not every hour.
>> But
>> >>> the log lines themselves are sequentially written out, and usually
>> the
>> >>> lines are within a few seconds of each other.
>> >>>
>> >>> For example:
>> >>>
>> >>> syslog-myhost-20091102-18.txt.gz: 3 lines, 2 with TS Oct 21
>> 18:46:34
>> >>> and one 18:46:35
>> >>> syslog-myhost-20091102-19.txt.gz: 17 lines, 3 Oct 21 18:46:35, 14
>> >>> lines Oct 21 18:46:36
>> >>> syslog-myhost-20091102-20.txt.gz: 12 lines, 8 Oct 21 18:46:36, 4
>> >>> lines Oct 21 18:46:37
>> >>>
>> >>> Thoughts?
>> >>> -Aaron
>> >>> _______________________________________________
>> >>> rsyslog mailing list
>> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >>> http://www.rsyslog.com
>> >> _______________________________________________
>> >> rsyslog mailing list
>> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> http://www.rsyslog.com
>> >>
>> > _______________________________________________
>> > rsyslog mailing list
>> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> > http://www.rsyslog.com
>> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
More information about the rsyslog
mailing list