[rsyslog-notify] Forum Thread: Dequeue Perfomance - (Mode 'edit_topic')
noreply at adiscon.com
noreply at adiscon.com
Sat May 2 05:42:19 CEST 2015
User: kamermans
Forumlink: http://kb.monitorware.com/viewtopic.php?p=25461#p25461
Message:
----------
Hi,
I am using rsyslog on several frontend webservers (aka 'webservers') that
forward requests via UDP to an rsyslog log aggregator (aka 'log-agg'), this
aggregator then forwards all the logs via RELP to our central rsyslog log
collector (aka 'collector').
At peak, log-agg is receiving 3000 req/sec on UDP and forwarding them our
to RELP at the same rate. Most of the time we receive the log entries
within 1 second of them being emitted by the webservers, so all is well.
When there is an interruption between log-agg and the collector, however,
log-agg is forced to spool up messages into its 30GB of RAM and 70GB of
dedicated spool disk space (using LinkedList). After a short network
outage, log-agg starts spooling things like crazy into RAM until it's full,
then down to disk.
The problem, however, is that today's outage only lasted one minute or so,
but it took 5.5 hours to dequeue, and a cursory look at my downstream stats
suggests that some data for about 30 minutes or so is completely gone.
The dequeuing issue is my main concern, so I would like some advice as to
what may be causing this, or how I can tweak the settings to causing
dequeuing to happen as rapidly as possible (btw, my connection is 50Mbps
up/down, so that's not the bottleneck).
Here are some performance graphs from the rsyslog stats module:
This shows log-agg on the left and our collector on the right, note that
this is in messages per 5 minutes, so 600K is actually 2,000/second. The
brief outage was at 11:00am.
[img:2nfqfcgd]http://s24.postimg.org/9oe5qbtc5/stats1.png[/img:2nfqfcgd]
Here you can see the following graphs:
1. Memory usage reported by the system
2. CPU usage reported by the system
3. Memory vs CPU usage reported by rsyslog
[img:2nfqfcgd]http://s14.postimg.org/sv46ejkrl/stats2.png[/img:2nfqfcgd]
And here is the anonymized config:
[code:2nfqfcgd]
# Stats logging must be first!
module( load="impstats"
interval="300"
severity="7"
log.syslog="off"
log.file="/var/log/rsyslog-stats.log")
# End stats logging
# Can't figure out how to set these in RainerScript format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
$RepeatedMsgReduction off
$FileOwner syslog
$FileGroup adm
$FileCreateMode 0640
$DirCreateMode 0755
$Umask 0022
$PrivDropToUser syslog
$PrivDropToGroup syslog
$WorkDirectory /var/spool/rsyslog
# Super-verbose, be careful!
# $DebugFile /var/log/rsyslog-debug.log
# $DebugLevel 2
$Ruleset RSYSLOG_DefaultRuleset
auth,authpriv.* /var/log/auth.log
*.*;auth,authpriv.none -/var/log/syslog
kern.* -/var/log/kern.log
mail.* -/var/log/mail.log
*.emerg :omusrmsg:*
# We are doing everything in this one file
# $IncludeConfig /etc/rsyslog.d/*.conf
module( load="imklog"
permitNonKernelFacility="on"
)
module( load="imuxsock")
module( load="imudp"
threads="2"
timeRequery="8"
batchSize="256"
)
module( load="imrelp" )
module( load="omrelp" )
ruleset(name="forwardRelp"
queue.type="LinkedList"
queue.fileName="relp_queue"
queue.spoolDirectory="/var/spool/rsyslog/relp_queue"
queue.maxDiskSpace="70g"
queue.maxFileSize="1g"
queue.size="172800000" # two days at 1000/sec
queue.dequeueBatchSize="512"
queue.saveOnShutdown="on"
queue.workerThreads="6"
queue.workerThreadMinimumMessages="512"
) {
action( type="omrelp"
target="10.2.3.4"
port="15140"
# action.retryCount="-1"
)
# Use this instead of the above to log locally
# action( type="omfile"
# file="/mnt/log_backup/local-backup.log"
# ioBufferSize="64k"
# flushOnTXEnd="off"
# asyncWriting="on"
# )
}
input( type="imuxsock"
socket="/dev/log"
)
# UDP Logging
input( type="imudp"
port="514"
ruleset="forwardRelp"
)
[/code:2nfqfcgd]
Thanks in advance for your help!
[EDIT]
Here are the raw stats for a few minutes before the event at 11:00am EST
(log is in UTC, so 11:00=15:00)
[code:2nfqfcgd]
Fri May 1 15:02:58 2015: action 1: origin=core.action
processed=83840 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:02:58 2015: action 2: origin=core.action
processed=5670 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:02:58 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:02:58 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:02:58 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:02:58 2015: imuxsock: origin=imuxsock
submitted=89362 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May 1 15:02:58 2015: action 6: origin=core.action
processed=2483254939 failed=0 suspended=21 suspended.duration=12870
resumed=0
Fri May 1 15:02:58 2015: imudp(*:514): origin=imudp
submitted=2483254939
Fri May 1 15:02:58 2015: imudp(*:514): origin=imudp
submitted=0
Fri May 1 15:02:58 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May 1 15:02:58 2015: resource-usage: origin=impstats
utime=92193642986 stime=58774558672 maxrss=11361436 minflt=24629546
majflt=0 inblock=8 oublock=183920 nvcsw=-1791335565 nivcsw=2642
Fri May 1 15:02:58 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May 1 15:02:58 2015: forwardRelp:
origin=core.queue size=0 enqueued=2483254939 full=0
discarded.full=0 discarded.nf=0 maxqsize=5388178
Fri May 1 15:02:58 2015: main Q: origin=core.queue
size=0 enqueued=89510 full=0 discarded.full=0 discarded.nf=0
maxqsize=4
Fri May 1 15:02:58 2015: imudp(w1): origin=imudp
called.recvmmsg=2096640422 called.recvmsg=0
msgs.received=462058817
Fri May 1 15:02:58 2015: imudp(w0): origin=imudp
called.recvmmsg=4363725842 called.recvmsg=0
msgs.received=2021196145
Fri May 1 15:07:58 2015: action 1: origin=core.action
processed=83840 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:07:58 2015: action 2: origin=core.action
processed=5673 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:07:58 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:07:58 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:07:58 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 15:07:58 2015: imuxsock: origin=imuxsock
submitted=89365 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May 1 15:07:58 2015: action 6: origin=core.action
processed=2483474026 failed=0 suspended=21 suspended.duration=12870
resumed=0
Fri May 1 15:07:58 2015: imudp(*:514): origin=imudp
submitted=2483834606
Fri May 1 15:07:58 2015: imudp(*:514): origin=imudp
submitted=0
Fri May 1 15:07:58 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May 1 15:07:58 2015: resource-usage: origin=impstats
utime=92212505472 stime=58779800388 maxrss=11361436 minflt=24629583
majflt=0 inblock=8 oublock=183944 nvcsw=-1790086037 nivcsw=2642
Fri May 1 15:07:58 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May 1 15:07:58 2015: forwardRelp:
origin=core.queue size=363652 enqueued=2483834606 full=0
discarded.full=0 discarded.nf=0 maxqsize=5388178
Fri May 1 15:07:58 2015: main Q: origin=core.queue
size=0 enqueued=89513 full=0 discarded.full=0 discarded.nf=0
maxqsize=4
Fri May 1 15:07:58 2015: imudp(w1): origin=imudp
called.recvmmsg=2097207863 called.recvmsg=0
msgs.received=462170911
Fri May 1 15:07:58 2015: imudp(w0): origin=imudp
called.recvmmsg=4364742305 called.recvmsg=0
msgs.received=2021663718
[/code:2nfqfcgd]
And a few minutes before the queue cleared around 16:00 (20:00 in GMT on
this log):
[code:2nfqfcgd]
Fri May 1 20:04:23 2015: action 1: origin=core.action
processed=108 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: action 2: origin=core.action
processed=14 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: imuxsock: origin=imuxsock
submitted=118 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May 1 20:04:23 2015: action 6: origin=core.action
processed=7281348 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:04:23 2015: imudp(*:514): origin=imudp
submitted=9313380
Fri May 1 20:04:23 2015: imudp(*:514): origin=imudp
submitted=0
Fri May 1 20:04:23 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May 1 20:04:23 2015: resource-usage: origin=impstats
utime=258079842 stime=114930910 maxrss=9113612 minflt=8153 majflt=0
inblock=544 oublock=288 nvcsw=18776821 nivcsw=1098
Fri May 1 20:04:23 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May 1 20:04:23 2015: forwardRelp:
origin=core.queue size=2035104 enqueued=9313380 full=0
discarded.full=0 discarded.nf=0 maxqsize=5703695
Fri May 1 20:04:23 2015: main Q: origin=core.queue
size=0 enqueued=122 full=0 discarded.full=0 discarded.nf=0
maxqsize=3
Fri May 1 20:04:23 2015: imudp(w1): origin=imudp
called.recvmmsg=9893888 called.recvmsg=0 msgs.received=2060670
Fri May 1 20:04:23 2015: imudp(w0): origin=imudp
called.recvmmsg=16106595 called.recvmsg=0 msgs.received=7252710
Fri May 1 20:14:23 2015: action 1: origin=core.action
processed=108 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: action 2: origin=core.action
processed=14 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: action 3: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: action 4: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: imuxsock: origin=imuxsock
submitted=118 ratelimit.discarded=0 ratelimit.numratelimiters=0
Fri May 1 20:14:23 2015: action 6: origin=core.action
processed=10316796 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri May 1 20:14:23 2015: imudp(*:514): origin=imudp
submitted=10316870
Fri May 1 20:14:23 2015: imudp(*:514): origin=imudp
submitted=0
Fri May 1 20:14:23 2015: imrelp[15140]:
origin=imrelp submitted=0
Fri May 1 20:14:23 2015: resource-usage: origin=impstats
utime=304349575 stime=122138106 maxrss=9113612 minflt=10804 majflt=0
inblock=544 oublock=304 nvcsw=20893782 nivcsw=1424
Fri May 1 20:14:23 2015: forwardRelp[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0
Fri May 1 20:14:23 2015: forwardRelp:
origin=core.queue size=981 enqueued=10316870 full=0
discarded.full=0 discarded.nf=0 maxqsize=5703695
Fri May 1 20:14:23 2015: main Q: origin=core.queue
size=0 enqueued=122 full=0 discarded.full=0 discarded.nf=0
maxqsize=3
Fri May 1 20:14:23 2015: imudp(w1): origin=imudp
called.recvmmsg=11087470 called.recvmsg=0 msgs.received=2395452
Fri May 1 20:14:23 2015: imudp(w0): origin=imudp
called.recvmmsg=17701091 called.recvmsg=0 msgs.received=7921418
[/code:2nfqfcgd]
More information about the rsyslog-notify
mailing list