[rsyslog-notify] Forum Thread: Rsyslog server recv-Q hits max, impacts endpoint - (Mode 'post')

Wed Mar 29 03:36:47 CEST 2017

User: leonidas 
Forumlink: http://kb.monitorware.com/viewtopic.php?p=27231#p27231

Message: 
----------
Good day, everyone.  We are experiencing a random, but reproducible problem
on our Rsyslog server.  It goes like this:

1)  We stop/start the rsyslog service on a the rsyslog server.

2)  At first, we see a number of connections from the endpoints in various
states and many have some level of recv-Qs.  These queues are not very big
and within a minute or two all connections are established and the recv-Qs
go to zero.  (Note that we are using a keepalive technique on the imtcp
module to solve another, previous problem with multiple sessions.  We got
the fix idea from the KB).

3)  One, sometimes two endpoints show the behaviour that the recv-Qs for
these devices on the Rsyslog server just goes up and up until it will go no
further.  I've seen it as high as 300000, sometimes more.  I apologize for
the stupid question but what is the default recv-Q max value for the
Rsyslog applications as opposed to the SO_RCVBUF on the OS?

4)  By that time, the send-Q on the endpoint max'es out and the endpoint
begins having all sorts of problems (e.g., very slow response for ssh
logins).

In analyzing this, we have found that

1)  the recv-Q growth on the rsyslog server doesn't depend what the OS of
the endpoint is.  We've seen it from several different OS'es.

2)  When we restart rsyslog service on this same rsyslog server again, we
see the same behaviour but now on different endpoints.  It seems to be a
random occurrence of different endpoints.

2)  We see this same behaviour on two different rsyslog servers. 
Configurations are essentially identical.

3)  Our traffic on the server mentioned here is approximately 40K
messages/5 minute impstats cycle.

4)  The traffic is 2/3 UDP and 1/3 TCP.

5)  We did a Wireshark capture on the rsyslog server looking at one of the
endpoints whose recv-Q or the rsyslog server grew large and then stopped. 
We saw that the rsyslog server's TCP window facing the endpoint was 0.  We
don't think this was a half open connection because we saw the keepalives
in the Wireshark capture.

The supposition is at restart of the rsyslog service, something happened to
the connection with the endpoint that caused rsyslog to not take any
messages off the recv-Q.  Or, if it did, it was not fast enough to keep the
recv-Q from filling up.  We never saw the recv-Q value go down during the
whole time it grew to a max size.

So, the questions are:

1)  What is causing the endpoint's recv-Q on the rsyslog server to just max
out and hang?

2)  Why does this only happen to 1 or 2 endpoints out of 100+ endpoints?  

3)  If anyone has seen this behaviour, do you have any suggestions on what
you did to fix it?

Thank you in advance for any information or enlightenment you can share.

Best Regards,

L