[rsyslog-notify] Forum Thread: Re: Dequeue Perfomance - (Mode 'reply')

Sat May 2 23:51:33 CEST 2015

User: kamermans 
Forumlink: http://kb.monitorware.com/viewtopic.php?p=25463#p25463

Message: 
----------
Thanks for your response.

I think I started increasing the worker threads in response to this same
problem in the past, to try to get dequeuing to improve, but what you say
makes sense, and I don't mind decreasing it back to 4 or something since it
doesn't seem to have improved anything.

My initial evidence for lost messages was in a downstream log processor
(Splunk in this case):

[img:3hoigo16]http://s7.postimg.org/gsk4gkt2z/stats3.png[/img:3hoigo16]

Last night I noticed something strange that might be relevant.  My stats
output from before the outage showed about 2.4 billion messages submitted
through imudp, and at the end there were only 10 million.  Having not
restarted rsyslog, I am confused about how this number can decrease, so I
went and found the first instance of this in the log:

[code:3hoigo16]
Fri May  1 16:43:00 2015: action 1: origin=core.action
processed=83956 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:43:00 2015: action 2: origin=core.action
processed=5684 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:43:00 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:43:00 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:43:00 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:43:00 2015: imuxsock: origin=imuxsock
submitted=89492 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May  1 16:43:00 2015: action 6: origin=core.action
processed=2486364778 failed=0 suspended=21 suspended.duration=12870
resumed=0
Fri May  1 16:43:00 2015: imudp(*:514): origin=imudp
submitted=2494236327
Fri May  1 16:43:00 2015: imudp(*:514): origin=imudp
submitted=0
Fri May  1 16:43:00 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May  1 16:43:00 2015: resource-usage: origin=impstats
utime=92508274650 stime=58877886701 maxrss=14996260 minflt=26979831
majflt=0 inblock=8 oublock=184288 nvcsw=-1769164642 nivcsw=2652
Fri May  1 16:43:00 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May  1 16:43:00 2015: forwardRelp:
origin=core.queue size=7874621 enqueued=2494236327 full=0
discarded.full=0 discarded.nf=0 maxqsize=7874621
Fri May  1 16:43:00 2015: main Q: origin=core.queue
size=0 enqueued=89640 full=0 discarded.full=0 discarded.nf=0
maxqsize=4
Fri May  1 16:43:00 2015: imudp(w1): origin=imudp
called.recvmmsg=2108182749 called.recvmsg=0
msgs.received=464489044
Fri May  1 16:43:00 2015: imudp(w0): origin=imudp
called.recvmmsg=4382661246 called.recvmsg=0
msgs.received=2029747306

Fri May  1 16:48:43 2015: action 1: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: action 2: origin=core.action
processed=7 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: imuxsock: origin=imuxsock
submitted=3 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May  1 16:48:43 2015: action 6: origin=core.action
processed=157582 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May  1 16:48:43 2015: imudp(*:514): origin=imudp
submitted=526464
Fri May  1 16:48:43 2015: imudp(*:514): origin=imudp
submitted=0
Fri May  1 16:48:43 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May  1 16:48:43 2015: resource-usage: origin=impstats
utime=12137128 stime=7377470 maxrss=643968 minflt=729 majflt=1 inblock=320
oublock=24 nvcsw=1060331 nivcsw=53
Fri May  1 16:48:43 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May  1 16:48:43 2015: forwardRelp:
origin=core.queue size=371954 enqueued=526464 full=0
discarded.full=0 discarded.nf=0 maxqsize=371954
Fri May  1 16:48:43 2015: main Q: origin=core.queue
size=0 enqueued=7 full=0 discarded.full=0 discarded.nf=0 maxqsize=2
Fri May  1 16:48:43 2015: imudp(w1): origin=imudp
called.recvmmsg=536399 called.recvmsg=0 msgs.received=100969
Fri May  1 16:48:43 2015: imudp(w0): origin=imudp
called.recvmmsg=925064 called.recvmsg=0 msgs.received=425495
[/code:3hoigo16]

As you can see in the 16:43 entry, imudp received 2494236327 messages, and
forwardRelp processed 2494236327, so everything was caught up, and it must
have been running for a few weeks for me to get 2.4 billion messages :)

Five minutes later, at 16:48, impudp was at 526464 and so was forwardRelp -
still caught up but something was reset.

I wonder what is happening here and if messages are being lost in this
process.  I'll write a script to parse my pstats log and see how often the
stats were reset and if there was ever any record of discarded messages,
etc.

Meanwhile, I'm still curious about what is slowing down the dequeuing
process.  I am about to spin up a new log server with 128GB of RAM and
RAID-SSD cache for spooling in case it comes to that.

Right now the only real option that stands out to me is to add another
log-aggregator, put them both behind a load balancer and put only one in
service until it's queue starts filling up, then switch to the standby
aggregator to let the primary do its dequeuing in peace.