[rsyslog-notify] Forum Thread: Failed RELP action discards messages - (Mode 'post')

Tue Aug 23 17:48:38 CEST 2016

User: kamermans 
Forumlink: http://kb.monitorware.com/viewtopic.php?p=26906#p26906

Message: 
----------
I've got a log aggregator that is receiving messages via UDP and forwarding
them to another log server via RELP.

Yesterday, we had a connection problem that lasted 26 seconds.  Since then,
*all messages* going through this action have "failed" (as show by
impstats).  I don't understand why the queue wasn't just suspended.  I use
LinkedList with a disk-assisted overflow that has been known to work fine
in the past.

Here is the config that sets up the queue in question and the action that
is failing:
[code:5eibbpw7]
ruleset(name="forwardPartition1"
        queue.type="LinkedList"

        # The number of messages that can be stored in RAM
        # Theoretical MAX: (15GB RAM / 1.5k msg) == 10,000,000
        # System has crashed with 1 queue at 7.8M
        # Keep this UNDER 6M
        queue.size="1000000"
        queue.dequeueBatchSize="750"
        queue.saveOnShutdown="on"
        queue.workerThreads="1"

        # Setting these enables Disk-Assisted mode
        queue.fileName="fp1_queue"
        queue.spoolDirectory="/var/spool/rsyslog/fp1_part1"
        # The total amount of disk space that can be used to store
        #  messages that overflow the memory-based LinkedList queue
        # NOTE: queue.size has no effect on the disk-assisted
storage
        queue.maxDiskSpace="35g"
        queue.maxFileSize="1g"
) {
        action( type="omrelp"
        name="fp1_omrelp"
                target="<log server ip address>"
                port="<port>"
                windowSize="1024"
        )

}
[/code:5eibbpw7]

Here you can see that at 01:05:43, impstats logged that action fp1_omrelp
was suspended, and after that, all messages failed.  Also of interest is
that the action was suspended 174 times before this and resumed 0 times -
not sure what's going on there.

After 1:05:43, no messages went out of that action for several hours.  We
rerouted the incoming data to another queue (called forwardPartition2/fp2)
which has exactly the same action, and everything is working fine, but the
queue in question is still stuck.

Here is the last imstats from before the event:
[code:5eibbpw7]
Tue Aug 23 01:05:13 2016: global: origin=dynstats 
Tue Aug 23 01:05:13 2016: imuxsock: origin=imuxsock
submitted=255937 ratelimit.discarded=0 ratelimit.numratelimiters=0 
Tue Aug 23 01:05:13 2016: fp1_omrelp:
origin=core.action processed=8600289257 failed=29559121 suspended=174
suspended.duration=11640 resumed=0 
Tue Aug 23 01:05:13 2016: fp2_omrelp:
origin=core.action processed=0 failed=0 suspended=0
suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: action 2: origin=core.action
processed=229874 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: action 3: origin=core.action
processed=26792 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: action 4: origin=core.action
processed=4 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: action 6: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:13 2016: imudp(*:514): origin=imudp
submitted=8600326654 
Tue Aug 23 01:05:13 2016: imudp(*:514): origin=imudp
submitted=0 
Tue Aug 23 01:05:13 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 01:05:13 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 01:05:13 2016: resource-usage: origin=impstats
utime=442314057420 stime=143753103122 maxrss=1373072 minflt=9045577
majflt=0 inblock=0 oublock=6363432 nvcsw=-485201193 nivcsw=1089403 
Tue Aug 23 01:05:13 2016: forwardPartition1[DA]:
origin=core.queue size=0 enqueued=813000 full=0 discarded.full=0
discarded.nf=0 maxqsize=203250 
Tue Aug 23 01:05:13 2016: forwardPartition1:
origin=core.queue size=38147 enqueued=8600326654 full=0
discarded.full=0 discarded.nf=0 maxqsize=900099 
Tue Aug 23 01:05:13 2016: forwardPartition2[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 01:05:13 2016: forwardPartition2:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 01:05:13 2016: main Q: origin=core.queue
size=0 enqueued=256666 full=0 discarded.full=0 discarded.nf=0
maxqsize=6 
Tue Aug 23 01:05:13 2016: imudp(w1): origin=imudp
called.recvmmsg=7349375882 called.recvmsg=0
msgs.received=1629754953 
Tue Aug 23 01:05:13 2016: imudp(w0): origin=imudp
called.recvmmsg=15070271528 called.recvmsg=0
msgs.received=6970571701
[/code:5eibbpw7]

And the first one after the event:
[code:5eibbpw7]
Tue Aug 23 01:05:43 2016: global: origin=dynstats 
Tue Aug 23 01:05:43 2016: imuxsock: origin=imuxsock
submitted=255937 ratelimit.discarded=0 ratelimit.numratelimiters=0 
Tue Aug 23 01:05:43 2016: fp1_omrelp:
origin=core.action processed=8600343509 failed=29613373 suspended=175
suspended.duration=11670 resumed=0 
Tue Aug 23 01:05:43 2016: fp2_omrelp:
origin=core.action processed=0 failed=0 suspended=0
suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: action 2: origin=core.action
processed=229874 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: action 3: origin=core.action
processed=26793 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: action 4: origin=core.action
processed=4 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: action 6: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 01:05:43 2016: imudp(*:514): origin=imudp
submitted=8600343509 
Tue Aug 23 01:05:43 2016: imudp(*:514): origin=imudp
submitted=0 
Tue Aug 23 01:05:43 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 01:05:43 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 01:05:43 2016: resource-usage: origin=impstats
utime=442314057420 stime=143753677847 maxrss=1373072 minflt=9051520
majflt=0 inblock=0 oublock=6363440 nvcsw=-485167564 nivcsw=1089405 
Tue Aug 23 01:05:43 2016: forwardPartition1[DA]:
origin=core.queue size=0 enqueued=813000 full=0 discarded.full=0
discarded.nf=0 maxqsize=203250 
Tue Aug 23 01:05:43 2016: forwardPartition1:
origin=core.queue size=0 enqueued=8600343509 full=0
discarded.full=0 discarded.nf=0 maxqsize=900099 
Tue Aug 23 01:05:43 2016: forwardPartition2[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 01:05:43 2016: forwardPartition2:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 01:05:43 2016: main Q: origin=core.queue
size=0 enqueued=256667 full=0 discarded.full=0 discarded.nf=0
maxqsize=6 
Tue Aug 23 01:05:43 2016: imudp(w1): origin=imudp
called.recvmmsg=7349390855 called.recvmsg=0
msgs.received=1629756511 
Tue Aug 23 01:05:43 2016: imudp(w0): origin=imudp
called.recvmmsg=15070303032 called.recvmsg=0
msgs.received=6970586998 
[/code:5eibbpw7]

And here is another one several hours later:
[code:5eibbpw7]
Tue Aug 23 09:48:14 2016: global: origin=dynstats 
Tue Aug 23 09:48:14 2016: imuxsock: origin=imuxsock
submitted=256292 ratelimit.discarded=0 ratelimit.numratelimiters=0 
Tue Aug 23 09:48:14 2016: fp1_omrelp:
origin=core.action processed=8618412814 failed=47682678 suspended=175
suspended.duration=43050 resumed=0 
Tue Aug 23 09:48:14 2016: fp2_omrelp:
origin=core.action processed=0 failed=0 suspended=0
suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: action 2: origin=core.action
processed=230159 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: action 3: origin=core.action
processed=26864 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: action 4: origin=core.action
processed=4 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: action 5: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: action 6: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 
Tue Aug 23 09:48:14 2016: imudp(*:514): origin=imudp
submitted=8618412814 
Tue Aug 23 09:48:14 2016: imudp(*:514): origin=imudp
submitted=0 
Tue Aug 23 09:48:14 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 09:48:14 2016: imudp(*:515): origin=imudp
submitted=0 
Tue Aug 23 09:48:14 2016: resource-usage: origin=impstats
utime=443041670618 stime=144109084662 maxrss=1373072 minflt=9052255
majflt=0 inblock=0 oublock=6374824 nvcsw=-431732183 nivcsw=1090540 
Tue Aug 23 09:48:14 2016: forwardPartition1[DA]:
origin=core.queue size=0 enqueued=813000 full=0 discarded.full=0
discarded.nf=0 maxqsize=203250 
Tue Aug 23 09:48:14 2016: forwardPartition1:
origin=core.queue size=0 enqueued=8618412814 full=0
discarded.full=0 discarded.nf=0 maxqsize=900099 
Tue Aug 23 09:48:14 2016: forwardPartition2[DA]:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 09:48:14 2016: forwardPartition2:
origin=core.queue size=0 enqueued=0 full=0 discarded.full=0
discarded.nf=0 maxqsize=0 
Tue Aug 23 09:48:14 2016: main Q: origin=core.queue
size=0 enqueued=257023 full=0 discarded.full=0 discarded.nf=0
maxqsize=6 
Tue Aug 23 09:48:14 2016: imudp(w1): origin=imudp
called.recvmmsg=7361925011 called.recvmsg=0
msgs.received=1631632936 
Tue Aug 23 09:48:14 2016: imudp(w0): origin=imudp
called.recvmmsg=15103866561 called.recvmsg=0
msgs.received=6986779878 
[/code:5eibbpw7]

So, between 1:05 and 9:48, 18123557 messages failed action fp1_omrelp, but
yet the queue that is sending them to fp1_omrelp, forwardPartition1
maintained a size of 0, as did it's disk-assist, forwardPartition1[DA].  We
also graph everything in impstats, so I can pull historical data on any
field here if it interests you.

Here is my rsyslog version:
[code:5eibbpw7]
rsyslogd 8.17.0, compiled with:
	PLATFORM:				x86_64-pc-linux-gnu
	PLATFORM (lsb_release -d):		
	FEATURE_REGEXP:				Yes
	GSSAPI Kerberos 5 support:		No
	FEATURE_DEBUG (debug build, slow code):	No
	32bit Atomic operations supported:	Yes
	64bit Atomic operations supported:	Yes
	memory allocator:			system default
	Runtime Instrumentation (slow code):	No
	uuid support:				Yes
	Number of Bits in RainerScript integers: 64
[/code:5eibbpw7]

My main questions are:
 1. If a message fails to send via RELP, should it not be left in the
queue? (my data shows they are discarded)
 2. When an action is suspended, what can cause it to resume?  The error
condition is clearly gone since another identical action on my server, in
the same instance of rsyslog, is doing just fine, sending messages via RELP
to the same destination.  Oh how I wish I could send a signal like SIGUSR1
to tell rsyslog to immediately try to resume all suspended actions.  We
have actually been in this scenario many times over the last few years:
rsyslog suspends an action, we clear the error condition, then sit around
and wait wondering whether to restart rsyslog and lose all our data, or
continue waiting as our new data starts to pile up on the old data, and the
action never resumes.  When the action does resume itself, it usually does
it several hours later, and this is highly inconvenient.

For the time being, rsyslog has not been restart - I have diverted all new
traffic to the fp2 queue so I can free up fp1 in case by some freak chance
is has retained my data and can resume processing.