From david at lang.hm Sat Aug 1 06:53:57 2009 From: david at lang.hm (david at lang.hm) Date: Fri, 31 Jul 2009 21:53:57 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors Message-ID: I have the following in the config file $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" if $fromhost == '192.168.210.216' then /var/log/scribe1a-p;raw if $fromhost == '192.168.210.217' then /var/log/scribe1a-b;raw if $fromhost == '192.168.210.219' then /var/log/scribe1b-p;raw if $fromhost == '192.168.210.220' then /var/log/scribe1b-b;raw if $fromhost == '192.168.210.222' then /var/log/scribe1c-p;raw if $fromhost == '192.168.210.223' then /var/log/scribe1c-b;raw if $fromhost == '192.168.210.245' then /var/log/scribe1d-p;raw but if I do a tail of these files I get very wierd results I have some logs in the wrong files, and I have some of them where the fromhost in in the hostname (and the hostname is in the syslogtag) the second error seems fairly consistant with a given source, unfortunantly the worst offender is another rsyslog 5.1.3 box. this first example shows the sceibe1b boxes with the incorrect hostname and system tag (scribe1b is the other rsyslog box, the one showing the problem) # tail scribe1* ==> scribe1a-b <== <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: unable to open S/MIME certificate '/var/spool/certs/chris.cournoyer at digitalinsight.com' 192.168.210.217 192.168.242.126 smelter <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad certificate 192.168.210.217 192.168.242.126 smelter ==> scribe1a-p <== <13>Jul 31 21:39:01 scribe1a-p getprocs: 28 /proc/net/tcp= 192.168.210.216 192.168.210.216 scribe1a-p <13>Jul 31 21:39:01 scribe1a-p getprocs: 138=9 /usr/sbin/apache=9 sleep 30=2 [pdflush]=2 /bin/bash /usr/local/bin/getprocs=1 [xfs_mru_cache]=1 [xfslogd/3]=1 [xfslogd/2]=1 [xfslogd/1]=1 [xfslogd/0]=1 [xfsdatad/3]= 192.168.210.216 192.168.210.216 scribe1a-p ==> scribe1b-b <== <13>Aug 1 04:39:01 scribe1b-b getprocs: 133=9 sleep 30=3 /usr/sbin/argus -w /var/log/argus/argus.log -n /var/run/argus.pid=3 /bin/bash /usr/local/bin/getprocs=2 [xfssyncd]=2 [xfsbufd]=2 [xfsaild]=2 [pdflush]=1 uniq -c=1 sort -rn=1 ps ax= 192.168.210.220 192.168.210.220 scribe1b-b <86>Aug 1 04:39:01 scribe1b-b CRON[21219]: pam_unix(cron:session): session closed for user root 192.168.210.220 192.168.210.220 scribe1b-b ==> scribe1b-p <== <13>Aug 1 00:40:14 MSWinEventLog\0111\011Applicatio Aug 01 00:39:57 2009\0111008\011Perflib\011Unknown User\011N/A\011Error\011BANKINGPDC1\011None\0110000: 68 10 00 00 78 bf 94 01 ...... \011The Open Procedure for service "PerfDisk" in DLL "C:\WINNT\system32\perfdisk.dll" failed. Performance data for this service will not be available. Status code returned is data DWORD 0. \01120258586192.168.210.219 192.168.210.219 MSWinEventLog\0111\011Applicatio <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect host= /192.168.242.211 destination=179.50.100.127/11282 in=3274 out=1448 duration=0 192.168.210.219 192.168.210.219 methane1e-b ==> scribe1c-b <== ==> scribe1c-p <== <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain=htpp,hostname=warden1-p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dstip=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_client=0,bytes_written_to_server=0,service_name=httpp,status=conn_close,acl_id=Warden__Outbound-DEV-NET,cache_hit=1,request_status=0,start_time="Fri Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 192.168.210.222 10.202.0.252 auditd: <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain=htpp,hostname=warden1-p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dstip=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_client=0,bytes_written_to_server=0,service_name=httpp,status=conn_close,acl_id=Warden__Outbound-DEV-NET,cache_hit=1,request_status=0,start_time="Fri Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 192.168.210.222 10.202.0.252 auditd: ==> scribe1d-p <== <175>Aug 1 00:39:22 172.20.254.6 ^A MSWinEventLog^I1^ISecurity^I343780120^IFri Jul 31 18:20:25 2009^I540^ISecurity^Idataman^IUser^ISuccess Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777242 192.168.210.245 172.20.254.6 ^A <175>Aug 1 00:39:22 172.20.254.6 ^A MSWinEventLog^I1^ISecurity^I343780121^IFri Jul 31 18:20:25 2009^I538^ISecurity^Idataman^IUser^ISuccess Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777243 192.168.210.245 172.20.254.6 ^A an example of the second problem is log entries like this <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= /192.168.243.38 destination=179.50.100.130/11074 192.168.210.245 192.168.210.245 methane1d-b the problem is that the log file on the .245 box (which log *.* to messages) don't show anything like this, and the methane1d-b box doesn't have any networks in common with the .245 box David Lang From david at lang.hm Mon Aug 10 23:23:26 2009 From: david at lang.hm (david at lang.hm) Date: Mon, 10 Aug 2009 14:23:26 -0700 (PDT) Subject: [rsyslog] 4.2.1 parse request Message-ID: I have a device creating logs like this (%raymsg%) <6>AUG 10 22:18:24 2009 netips-warden2-p [audit] user=[*SMS] src=192.168.11.11 iface=5 access=9 Update State Reset rsyslog 4.2.1 makes the hostname 'AUG' and the syslogtag '10' how bad would it be to check to see if it's a case problem like this when no timestamp is detected? David Lang From webmail.hce at gmail.com Tue Aug 11 06:57:12 2009 From: webmail.hce at gmail.com (hce) Date: Tue, 11 Aug 2009 14:57:12 +1000 Subject: [rsyslog] facility filter Message-ID: <95455e980908102157x140a8774h7aa24942ad0af96f@mail.gmail.com> Hi, I have two C++ programs, one uses LOCAL0 and the other uses LOCAL1. At the moment, I tested it on fedora 9, but it will finally move to a CentOS 5 to remotely log messages to a log server on fedora 9. I configured following to the rsyslog.conf in Fedora 9: *.info;mail.none;authpriv.none;cron.none;local0.none;local1.none /var/log/messages local0.* /tmp/local0.log local1.* /tmp/local1.log Now the program writes to LOCAL0 worked fine, all messages went to the local0.log file. But the program writes to LOCAL1 did not work, the messages still went to /var/log/messages, no messages in local1.log. I then made following tests which even more confusing me: logger -p 1 local1.info "testing", message went to /var/log/messages logger -p local1.info "testing", message went to local1.log logger -p 1 local0.info "testing", message went to /var/log/messages logger -p local0.info "testing", message went to local0.log Seems messages went to different log depending on different priorities? What I am missing here? Thank you. Kind Regards, Jupiter From sparf at vingrad.ru Wed Aug 12 15:25:21 2009 From: sparf at vingrad.ru (Stanislav) Date: Wed, 12 Aug 2009 17:25:21 +0400 Subject: [rsyslog] utf-8 encoded MSG Message-ID: Hello, First of all, sorry for my English. It is very bad. I am trying to make MS Log Parser ( http://www.microsoft.com/technet/scriptcenter/tools/logparser/default.mspx) working correct with rsyslog server. And now I have one problem. I can?t understand how to create a correct syslog utf-8 message. I have read in documentation (doc/syslog-protocol.html) that: Conlusions/Suggestions ? As it is not possible to definitely know the character encoding of the application-provided message, MSG should *not* be specified to use UTF-8 exclusively. Instead, it is suggested that any encoding may be used but UTF-8 is preferred. To detect UTF-8, the MSG should start with the UTF-8 byte order mask of "EF BB BF" if it is UTF-8 encoded (see section 155.9 of http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf) For example here we have ?EF BB BF? before UTF-8 encoded string. 0000 3c 33 30 3e 41 70 72 20 31 35 20 31 38 3a 33 36 <30>Apr 15 18:36 0010 3a 35 37 20 50 4b 2d 35 38 30 20 53 65 72 76 69 :57 PK-580 Servi 0020 63 65 43 6f 6e 74 72 6f 6c 4d 61 6e 61 67 65 72 ceControlManager 0030 20 ef bb bf d0 a1 d0 bb d1 83 d0 b6 d0 b1 d0 b0 ............... 0040 20 22 d0 90 d0 b4 d0 b0 d0 bf d1 82 d0 b5 d1 80 ".............. 0050 20 d0 bf d1 80 d0 be d0 b8 d0 b7 d0 b2 d0 be d0 ............... 0060 b4 d0 b8 d1 82 d0 b5 d0 bb d1 8c d0 bd d0 be d1 ................ 0070 81 d1 82 d0 b8 20 57 4d 49 22 20 d0 bf d0 b5 d1 ..... WMI" ..... 0080 80 d0 b5 d1 88 d0 bb d0 b0 20 d0 b2 20 d1 81 d0 ......... .. ... 0090 be d1 81 d1 82 d0 be d1 8f d0 bd d0 b8 d0 b5 20 ............... 00a0 d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 d0 b5 d1 82 ................ 00b0 2e . But in the database I see only broken encoding (with one space in the beginning of the string): ??????????????? "?????????????? ???????????????????????????????????? WMI" ?????????????? ?? ?????????????????? ? ??????????????. In correct form it looks like: ?????? "??????? ?????????????????? WMI" ??????? ? ????????? ????????. P.S.We have rsyslog-3.18.6-4 installed with logging to mysql database. Best regards. Stanislav. From lftsy at leurent.eu Wed Aug 12 19:05:13 2009 From: lftsy at leurent.eu (Marc Leurent) Date: Wed, 12 Aug 2009 19:05:13 +0200 Subject: [rsyslog] Asterisk log into rsyslog mysql backend with double quotes Message-ID: <200908121905.13176.lftsy@leurent.eu> Good evening, I have a simple problem and I would be very grateful if you could help me, I'm trying to put Asterisk logs into a MySQL database using the powerful rsyslog! It's working, except when Asterisk generates syslog message with double quotes! So my Asterisk server is logging to syslog.local0 into rsyslog and with the config below, I put the logs into a mysql database! $template bobsql,"INSERT INTO log (server, date, heure, message) VALUES ('%HOSTNAME%', DATE(%timereported:::date-mysql%), TIME(%timereported:::date-mysql%), '%msg%')",SQL local0.* :ommysql:localhost,asterisk_db,asterisk,bob;bobsql It adds any line except when the %msg% contains a double quote like this one: NOTICE[25608] chan_sip.c: Registration from '"1001" ' failed for '88.191.80.8' - Wrong password Could you explain me what I should change to be able to put double quote in the database. Thanks! -- -- -- Marc LEURENT lftsy at leurent.eu From david at lang.hm Wed Aug 12 21:06:18 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 12 Aug 2009 12:06:18 -0700 (PDT) Subject: [rsyslog] rsyslog config feature request Message-ID: two thigns have been giving me grief 1. when doing filtering in rsyslog, it would be _very_ nice if you would accept single quotes everywhere you accept double quotes. currently rsyslog just dies (with no useful error message) if you quote a string like 'this' instead of "this" 2. as I am dealing with garbage logs from different vendors I am running into many special case situations where I can detect the breakage and define how to fix it, but it gets very cumbersom to have a dozen different sets of formats (one for writing locally, one for forwarding), and then for each detected condition list all of the actions (with the particular formats for that condition) it would be _very_ nice if there was some way to fix the internal variables so that I could have the conditional and then code to fix up the hostname, syslogtag, message, etc to then fall through to one set of formats and actions. David Lang From david at lang.hm Wed Aug 12 21:07:26 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 12 Aug 2009 12:07:26 -0700 (PDT) Subject: [rsyslog] maintainer on vacation Message-ID: I've seen a couple posts to the list asking for help. unfortunantly I'm not sure of the right answer for the questions or I would have answered them. the maintainer is on vacation until around the end of next week. David Lang From rgerhards at hq.adiscon.com Wed Aug 12 21:29:58 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 12 Aug 2009 21:29:58 +0200 Subject: [rsyslog] 4.2.1 parse request Message-ID: <002a01ca1b83$9f329efa$100013ac@intern.adiscon.com> David, I think this is fairly trivial (but i may be wrong), probably best done during the timestamp check. I will look at it when i am back. It would probably be a good idea to create an enhancement request with the bugzilla. rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 10.08.09 23:24 Betreff: [rsyslog] 4.2.1 parse request I have a device creating logs like this (%raymsg%) <6>AUG 10 22:18:24 2009 netips-warden2-p [audit] user=[*SMS] src=192.168.11.11 iface=5 access=9 Update State Reset rsyslog 4.2.1 makes the hostname 'AUG' and the syslogtag '10' how bad would it be to check to see if it's a case problem like this when no timestamp is detected? David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 12 21:49:16 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 12 Aug 2009 21:49:16 +0200 Subject: [rsyslog] Asterisk log into rsyslog mysql backend with double quotes Message-ID: <002b01ca1b86$52415dae$100013ac@intern.adiscon.com> Not sure if i can look at this the next days (i have limited email only connnectivity...), but this looks somewhat strange. Please try to get a error message, maybe via running in debug mode. ----- Urspr?ngliche Nachricht ----- Von: "Marc Leurent" An: "rsyslog at lists.adiscon.com" Gesendet: 12.08.09 19:12 Betreff: [rsyslog] Asterisk log into rsyslog mysql backend with double quotes Good evening, I have a simple problem and I would be very grateful if you could help me, I'm trying to put Asterisk logs into a MySQL database using the powerful rsyslog! It's working, except when Asterisk generates syslog message with double quotes! So my Asterisk server is logging to syslog.local0 into rsyslog and with the config below, I put the logs into a mysql database! $template bobsql,"INSERT INTO log (server, date, heure, message) VALUES ('%HOSTNAME%', DATE(%timereported:::date-mysql%), TIME(%timereported:::date-mysql%), '%msg%')",SQL local0.* :ommysql:localhost,asterisk_db,asterisk,bob;bobsql It adds any line except when the %msg% contains a double quote like this one: NOTICE[25608] chan_sip.c: Registration from '"1001" ' failed for '88.191.80.8' - Wrong password Could you explain me what I should change to be able to put double quote in the database. Thanks! -- -- -- Marc LEURENT lftsy at leurent.eu _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 12 21:49:23 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 12 Aug 2009 21:49:23 +0200 Subject: [rsyslog] rsyslog config feature request Message-ID: <002c01ca1b86$54172b6a$100013ac@intern.adiscon.com> David, Both is related to the scripting engine, which i unfortunately had not yet time to work on... Single/double quotes will hvae different semantics, so i can not use them interchangebly. The setting of parameters/fields is one primary goal of the engine. So it boils down we need to wait until it materializes (or at least the parts we need). Doesnt look so this will be before the late fall/winter timeframe. rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 12.08.09 21:06 Betreff: [rsyslog] rsyslog config feature request two thigns have been giving me grief 1. when doing filtering in rsyslog, it would be _very_ nice if you would accept single quotes everywhere you accept double quotes. currently rsyslog just dies (with no useful error message) if you quote a string like 'this' instead of "this" 2. as I am dealing with garbage logs from different vendors I am running into many special case situations where I can detect the breakage and define how to fix it, but it gets very cumbersom to have a dozen different sets of formats (one for writing locally, one for forwarding), and then for each detected condition list all of the actions (with the particular formats for that condition) it would be _very_ nice if there was some way to fix the internal variables so that I could have the conditional and then code to fix up the hostname, syslogtag, message, etc to then fall through to one set of formats and actions. David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 12 21:49:25 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 12 Aug 2009 21:49:25 +0200 Subject: [rsyslog] utf-8 encoded MSG Message-ID: <002d01ca1b86$55286c38$100013ac@intern.adiscon.com> The simple answer (unfortunately) is that utf8 is not yet supported. There are a number of subtle issues and i was so far hesitant to adress these (to do it right, we would need to work on changes of the logger api...) ----- Urspr?ngliche Nachricht ----- Von: "Stanislav" An: "rsyslog at lists.adiscon.com" Gesendet: 12.08.09 15:33 Betreff: [rsyslog] utf-8 encoded MSG Hello, First of all, sorry for my English. It is very bad. I am trying to make MS Log Parser ( http://www.microsoft.com/technet/scriptcenter/tools/logparser/default.mspx) working correct with rsyslog server. And now I have one problem. I can?t understand how to create a correct syslog utf-8 message. I have read in documentation (doc/syslog-protocol.html) that: Conlusions/Suggestions ? As it is not possible to definitely know the character encoding of the application-provided message, MSG should *not* be specified to use UTF-8 exclusively. Instead, it is suggested that any encoding may be used but UTF-8 is preferred. To detect UTF-8, the MSG should start with the UTF-8 byte order mask of "EF BB BF" if it is UTF-8 encoded (see section 155.9 of http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf) For example here we have ?EF BB BF? before UTF-8 encoded string. 0000 3c 33 30 3e 41 70 72 20 31 35 20 31 38 3a 33 36 <30>Apr 15 18:36 0010 3a 35 37 20 50 4b 2d 35 38 30 20 53 65 72 76 69 :57 PK-580 Servi 0020 63 65 43 6f 6e 74 72 6f 6c 4d 61 6e 61 67 65 72 ceControlManager 0030 20 ef bb bf d0 a1 d0 bb d1 83 d0 b6 d0 b1 d0 b0 ............... 0040 20 22 d0 90 d0 b4 d0 b0 d0 bf d1 82 d0 b5 d1 80 ".............. 0050 20 d0 bf d1 80 d0 be d0 b8 d0 b7 d0 b2 d0 be d0 ............... 0060 b4 d0 b8 d1 82 d0 b5 d0 bb d1 8c d0 bd d0 be d1 ................ 0070 81 d1 82 d0 b8 20 57 4d 49 22 20 d0 bf d0 b5 d1 ..... WMI" ..... 0080 80 d0 b5 d1 88 d0 bb d0 b0 20 d0 b2 20 d1 81 d0 ......... .. ... 0090 be d1 81 d1 82 d0 be d1 8f d0 bd d0 b8 d0 b5 20 ............... 00a0 d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 d0 b5 d1 82 ................ 00b0 2e . But in the database I see only broken encoding (with one space in the beginning of the string): ??????????????? "?????????????? ???????????????????????????????????? WMI" ?????????????? ?? ?????????????????? ? ??????????????. In correct form it looks like: ?????? "??????? ?????????????????? WMI" ??????? ? ????????? ????????. P.S.We have rsyslog-3.18.6-4 installed with logging to mysql database. Best regards. Stanislav. _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 12 21:50:42 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 12 Aug 2009 21:50:42 +0200 Subject: [rsyslog] maintainer on vacation Message-ID: <003501ca1b86$836b5db9$100013ac@intern.adiscon.com> Thanks david. I'll be back early next week, but don't know what pile of work is waiting. So end of next week is a good assumption. ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 12.08.09 21:07 Betreff: [rsyslog] maintainer on vacation I've seen a couple posts to the list asking for help. unfortunantly I'm not sure of the right answer for the questions or I would have answered them. the maintainer is on vacation until around the end of next week. David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From lftsy at leurent.eu Thu Aug 13 17:27:51 2009 From: lftsy at leurent.eu (Marc Leurent) Date: Thu, 13 Aug 2009 17:27:51 +0200 Subject: [rsyslog] Asterisk log into rsyslog mysql backend with double quotes In-Reply-To: <002b01ca1b86$52415dae$100013ac@intern.adiscon.com> References: <002b01ca1b86$52415dae$100013ac@intern.adiscon.com> Message-ID: <200908131727.51390.lftsy@leurent.eu> Hello, I was using rsyslog 4.2.0-1 on a Debian Squeeze server! I have backported the rsyslog 3.18.6-4 version and the logging of message with double quotes into mysql database is working. So I think that is a bug with the 4.2.0-1 version How can I help filling a bug report? Thanks -- -- -- Marc LEURENT lftsy at leurent.eu Le mercredi, 12 ao?t 2009 21.49:16, Rainer Gerhards a ?crit : > Not sure if i can look at this the next days (i have limited email only > connnectivity...), but this looks somewhat strange. Please try to get a > error message, maybe via running in debug mode. > > ----- Urspr?ngliche Nachricht ----- > Von: "Marc Leurent" > An: "rsyslog at lists.adiscon.com" > Gesendet: 12.08.09 19:12 > Betreff: [rsyslog] Asterisk log into rsyslog mysql backend with double > quotes > > Good evening, > I have a simple problem and I would be very grateful if you could help me, > I'm trying to put Asterisk logs into a MySQL database using the powerful > rsyslog! > It's working, except when Asterisk generates syslog message with double > quotes! > > So my Asterisk server is logging to syslog.local0 into rsyslog > and with the config below, I put the logs into a mysql database! > > $template bobsql,"INSERT INTO log (server, date, heure, message) VALUES > ('%HOSTNAME%', DATE(%timereported:::date-mysql%), > TIME(%timereported:::date-mysql%), '%msg%')",SQL > local0.* :ommysql:localhost,asterisk_db,asterisk,bob;bobsql > > > It adds any line except when the %msg% contains a double quote like this > one: NOTICE[25608] chan_sip.c: Registration from '"1001" > ' failed for '88.191.80.8' - Wrong password > > Could you explain me what I should change to be able to put double quote in > the database. Thanks! From rgerhards at hq.adiscon.com Mon Aug 17 14:47:29 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 17 Aug 2009 14:47:29 +0200 Subject: [rsyslog] 4.2.1 parse request References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FCF1@GRFEXC.intern.adiscon.com> Hi David, I was able to extend the parser to support it. However, I can not do that in 4.2.1, as this is a stable release and, by definition, stable releases do only receive bug fixes (and this is a feature enhancement). So it is part of the next version of v4-devel. However, the patch should be easy to apply to previous versions if you like to. Find it here: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=aa10f7a16415112c014c6c62 8f2f25f4eb4beaa2 Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Monday, August 10, 2009 11:23 PM > To: rsyslog-users > Subject: [rsyslog] 4.2.1 parse request > > I have a device creating logs like this (%raymsg%) > > <6>AUG 10 22:18:24 2009 netips-warden2-p [audit] user=[*SMS] > src=192.168.11.11 iface=5 access=9 Update State Reset > > rsyslog 4.2.1 makes the hostname 'AUG' and the syslogtag '10' > > how bad would it be to check to see if it's a case problem like this > when > no timestamp is detected? > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From mike at cloudant.com Tue Aug 18 20:45:22 2009 From: mike at cloudant.com (Michael Miller) Date: Tue, 18 Aug 2009 11:45:22 -0700 Subject: [rsyslog] Creating new output module Message-ID: <505CB858-3568-4B09-BB64-91B581D0F402@cloudant.com> Hi, I'm trying to follow the examples to create a new output module. I'm pretty happy with the documentation of what needs to be done within the c code, but I'm struggling with adding a new output module into the build. I though that that the steps were: 1) modify rsyslog-5.1.3/configure.ac to add the blocks: # omfoo AC_ARG_ENABLE(omfoo, [AS_HELP_STRING([--enable-omfoo],[Compiles omfoo template module @<:@default=no@:>@])], [case "${enableval}" in yes) enable_omfoo="yes" ;; no) enable_omfoo="no" ;; *) AC_MSG_ERROR(bad value ${enableval} for --enable-omfoo) ;; esac], [enable_omfoo=no] ) AM_CONDITIONAL(ENABLE_OMTEMPLATE, test x$enable_omfoo = xyes) # end of copy template and: plugins/omfoo/Makefile \ at the bottom. 2) Creat a plugins/omfoo/ directory, populate it from omtemplate and suitably change the makefile and source code 3) Generate new configure scripts: install m4-1.4.13 install autoconf-2.64 install libtool-2.2.6 autoreconf autoconf ./configure --enable-omfoo make However, something seems fishy with the configure scripts that I generated, they end up creating makefiles that contain: 'gcc .... - rpath ...' and in my version of gcc (gcc (Ubuntu 4.3.2-1ubuntu12) 4.3.2) seems to expect --rpath instead of -rpath. Looks like osx gets lucky and respects both. But the fact that autotools didn't get this right makes me suspicious that I've screwed up. I also tried adding an automake before autoconf, but same error. My build also fails if I remove the --enable-omfoo option to ./configure. I've moved my current version aside and gone back to building/running the unmodified rsyslog-5.1.3 version from vanilla source and it's fine, so it has nothing to do with the install of m4, autoconf, or libtool. Any advice would be greatly appreciated. -Mike Miller From rgerhards at hq.adiscon.com Wed Aug 19 08:20:49 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 19 Aug 2009 08:20:49 +0200 Subject: [rsyslog] Creating new output module References: <505CB858-3568-4B09-BB64-91B581D0F402@cloudant.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD18@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com > [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of Michael Miller > Sent: Tuesday, August 18, 2009 8:45 PM > To: rsyslog at lists.adiscon.com > Subject: [rsyslog] Creating new output module > > Hi, > > I'm trying to follow the examples to create a new output > module. I'm > pretty happy with the documentation of what needs to be done within > the c code, but I'm struggling with adding a new output module into > the build. I though that that the steps were: > > 1) modify rsyslog-5.1.3/configure.ac to add the blocks: > > # omfoo > AC_ARG_ENABLE(omfoo, > [AS_HELP_STRING([--enable-omfoo],[Compiles omfoo template > module @<:@default=no@:>@])], > [case "${enableval}" in > yes) enable_omfoo="yes" ;; > no) enable_omfoo="no" ;; > *) AC_MSG_ERROR(bad value ${enableval} for > --enable-omfoo) ;; > esac], > [enable_omfoo=no] > ) > AM_CONDITIONAL(ENABLE_OMTEMPLATE, test x$enable_omfoo = xyes) Here you overlooked to replace ENABLE_OMTEMPLATE > # end of copy template > > and: > > plugins/omfoo/Makefile \ > > at the bottom. > > 2) Creat a plugins/omfoo/ directory, populate it from omtemplate and > suitably change the makefile and source code > > 3) Generate new configure scripts: > > install m4-1.4.13 > install autoconf-2.64 > install libtool-2.2.6 > > autoreconf > autoconf > ./configure --enable-omfoo > make > > However, something seems fishy with the configure scripts that I > generated, they end up creating makefiles that contain: 'gcc .... - > rpath ...' and in my version of gcc (gcc (Ubuntu 4.3.2-1ubuntu12) > 4.3.2) seems to expect --rpath instead of -rpath. Looks like > osx gets I don't know about the particular options, but don't think they are affected by the config files (but I am far from being an autotools expert...). > lucky and respects both. But the fact that autotools didn't > get this > right makes me suspicious that I've screwed up. I also tried adding > an automake before autoconf, but same error. My build also > fails if I > remove the --enable-omfoo option to ./configure. I've moved my > current version aside and gone back to building/running the > unmodified > rsyslog-5.1.3 version from vanilla source and it's fine, so it has > nothing to do with the install of m4, autoconf, or libtool. > Any advice > would be greatly appreciated. Maybe the problem is caused by the one replacement you have overlooked. If it persists, you can post a tarball and I'll have a look at the complete tree. All in all, what you wrote sounds good to me. Rainer > > -Mike Miller > > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Wed Aug 19 15:11:51 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 19 Aug 2009 15:11:51 +0200 Subject: [rsyslog] reliability of SSD disks? Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Hi all, quick question to those in the know: are SSD disks considered reliable from an auditing (or near audit-grade) point of few? Thank to a hard disk failure, I finally got such a disk in my workstation and the performance improvement is obviously very good and creates quite a different view about the volume that rsyslog can do with "disk" queues. All thoughts are appreciated (but I have to admit I primarily ask out of curiosity). Rainer From epiphani at gmail.com Wed Aug 19 15:19:30 2009 From: epiphani at gmail.com (Aaron Wiebe) Date: Wed, 19 Aug 2009 09:19:30 -0400 Subject: [rsyslog] reliability of SSD disks? In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: SSD's are as reliable, if not more reliable, than your regular spinning rust. But if you want to get even more speed, and just as much reliability, check out fusionio.com. They're launching a consumer market PCIe card at 80GB this year (at $800 a card). The technology is -very- cool though, I got a presentation about it this week (and met the woz!). -Aaron On Wed, Aug 19, 2009 at 9:11 AM, Rainer Gerhards wrote: > Hi all, > > quick question to those in the know: are SSD disks considered reliable from > an auditing (or near audit-grade) point of few? Thank to a hard disk failure, > I finally got such a disk in my workstation and the performance improvement > is obviously very good and creates quite a different view about the volume > that rsyslog can do with "disk" queues. > > All thoughts are appreciated (but I have to admit I primarily ask out of > curiosity). > > Rainer > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From mrdemeanour at jackpot.uk.net Wed Aug 19 15:52:54 2009 From: mrdemeanour at jackpot.uk.net (Mr. Demeanour) Date: Wed, 19 Aug 2009 14:52:54 +0100 Subject: [rsyslog] [OT] Re: reliability of SSD disks? In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: <4A8C03B6.9080300@jackpot.uk.net> Aaron Wiebe wrote: > SSD's are as reliable, if not more reliable, than your regular > spinning rust. > > But if you want to get even more speed, and just as much reliability, > check out fusionio.com. They're launching a consumer market PCIe > card at 80GB this year (at $800 a card). The technology is -very- > cool though, I got a presentation about it this week (and met the > woz!). So the website (http://www.fusionio.com) doesn't seem to offer any information on the number of write cycles these devices are able to sustain. Since they are NAND Flash, this would be expected to be up to about a million, rendering them of dubious value as a part of a logging system. With wear-levelling, once cells start dying, I'd expect the death-rate to climb very rapidly indeed. I would have thought that DRAM-based devices would be more suitable in this role. -- Jack. From epiphani at gmail.com Wed Aug 19 16:09:00 2009 From: epiphani at gmail.com (Aaron Wiebe) Date: Wed, 19 Aug 2009 10:09:00 -0400 Subject: [rsyslog] [OT] Re: reliability of SSD disks? In-Reply-To: <4A8C03B6.9080300@jackpot.uk.net> References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> <4A8C03B6.9080300@jackpot.uk.net> Message-ID: On Wed, Aug 19, 2009 at 9:52 AM, Mr. Demeanour wrote: > > So the website (http://www.fusionio.com) doesn't seem to offer any > information on the number of write cycles these devices are able to > sustain. Since they are NAND Flash, this would be expected to be up to > about a million, rendering them of dubious value as a part of a logging > system. With wear-levelling, once cells start dying, I'd expect the > death-rate to climb very rapidly indeed. This hasn't been true for years. With the improvements in NAND and wear leveling techniques, the performance and lifetime of NAND based storage outweighs that of spinning rust. On fusion-io specifically, they also use a byte-level RAID type algorithm to handle failures. In their talk earlier this week, they told us that they have -never- had a card come back failed. But many people still software-raid1 two cards together. DRAM tends to be more complicated in power handling, and getting 640GB of it can be very expensive indeed. Granted, the applications are similar, but I would expect PCIe based devices to take bigger hold over the next few years. -Aaron From david at lang.hm Wed Aug 19 17:50:20 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 19 Aug 2009 08:50:20 -0700 (PDT) Subject: [rsyslog] reliability of SSD disks? In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: On Wed, 19 Aug 2009, Rainer Gerhards wrote: > Hi all, > > quick question to those in the know: are SSD disks considered reliable from > an auditing (or near audit-grade) point of few? Thank to a hard disk failure, > I finally got such a disk in my workstation and the performance improvement > is obviously very good and creates quite a different view about the volume > that rsyslog can do with "disk" queues. they are far more reliable than normal drives, but you would still want to have a mirrored pair for true audit-grade purposes. they do wear out over time (although that time is expected to be several years worth of continuous write activity) that being said, for my normal systems I am now buying a single SSD where before I purchased a mirrored pair of high-speed SCSI/SAS drives David Lang > All thoughts are appreciated (but I have to admit I primarily ask out of > curiosity). > > Rainer > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From jmoyer at redhat.com Wed Aug 19 18:21:37 2009 From: jmoyer at redhat.com (Jeff Moyer) Date: Wed, 19 Aug 2009 12:21:37 -0400 Subject: [rsyslog] reliability of SSD disks? In-Reply-To: (david@lang.hm's message of "Wed, 19 Aug 2009 08:50:20 -0700 (PDT)") References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: david at lang.hm writes: > On Wed, 19 Aug 2009, Rainer Gerhards wrote: > >> Hi all, >> >> quick question to those in the know: are SSD disks considered reliable from >> an auditing (or near audit-grade) point of few? Thank to a hard disk failure, >> I finally got such a disk in my workstation and the performance improvement >> is obviously very good and creates quite a different view about the volume >> that rsyslog can do with "disk" queues. > > they are far more reliable than normal drives, but you would still want to > have a mirrored pair for true audit-grade purposes. they do wear out over > time (although that time is expected to be several years worth of > continuous write activity) > > that being said, for my normal systems I am now buying a single SSD where > before I purchased a mirrored pair of high-speed SCSI/SAS drives I find these claims of reliability surprising, if only due to the lack of soak time for such drives. There is also no mention of the class of device. Are we talking about consumer grade MLC? SLC? Are some vendors' devices better than others? Not all SSDs are created equal. Cheers, Jeff From rio at rio.st Wed Aug 19 20:07:49 2009 From: rio at rio.st (=?ISO-2022-JP?B?GyRCRiNFRBsoQiAbJEJORxsoQg==?=) Date: Thu, 20 Aug 2009 03:07:49 +0900 Subject: [rsyslog] reliability of SSD disks? In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: <922E84E0-C0ED-4CE9-8D15-86881F60EA03@rio.st> Hi all, I've used Intel's one (SSDSA2MH080G1, MLC cell) as follows. - for RHEL5 server for 9 months long, not RAIDed - for Mac OS X for 3 months long, not RAIDed http://www.intel.com/design/flash/nand/mainstream/index.htm I've not met any disk I/O troubles. But some of my colleagues have met troubles with cheaper SSDs. As Jeff-san pointed out, NOT ALL SSDs are created equal : ( On 2009/08/20, at 1:21, Jeff Moyer wrote: > david at lang.hm writes: > >> On Wed, 19 Aug 2009, Rainer Gerhards wrote: >> >>> Hi all, >>> >>> quick question to those in the know: are SSD disks considered >>> reliable from >>> an auditing (or near audit-grade) point of few? Thank to a hard >>> disk failure, >>> I finally got such a disk in my workstation and the performance >>> improvement >>> is obviously very good and creates quite a different view about >>> the volume >>> that rsyslog can do with "disk" queues. >> >> they are far more reliable than normal drives, but you would still >> want to >> have a mirrored pair for true audit-grade purposes. they do wear >> out over >> time (although that time is expected to be several years worth of >> continuous write activity) >> >> that being said, for my normal systems I am now buying a single SSD >> where >> before I purchased a mirrored pair of high-speed SCSI/SAS drives > > I find these claims of reliability surprising, if only due to the lack > of soak time for such drives. There is also no mention of the class > of > device. Are we talking about consumer grade MLC? SLC? Are some > vendors' devices better than others? Not all SSDs are created equal. > > Cheers, > Jeff > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From david at lang.hm Thu Aug 20 05:04:51 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 19 Aug 2009 20:04:51 -0700 (PDT) Subject: [rsyslog] reliability of SSD disks? In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: On Wed, 19 Aug 2009, Jeff Moyer wrote: > david at lang.hm writes: > >> On Wed, 19 Aug 2009, Rainer Gerhards wrote: >> >>> Hi all, >>> >>> quick question to those in the know: are SSD disks considered reliable from >>> an auditing (or near audit-grade) point of few? Thank to a hard disk failure, >>> I finally got such a disk in my workstation and the performance improvement >>> is obviously very good and creates quite a different view about the volume >>> that rsyslog can do with "disk" queues. >> >> they are far more reliable than normal drives, but you would still want to >> have a mirrored pair for true audit-grade purposes. they do wear out over >> time (although that time is expected to be several years worth of >> continuous write activity) >> >> that being said, for my normal systems I am now buying a single SSD where >> before I purchased a mirrored pair of high-speed SCSI/SAS drives > > I find these claims of reliability surprising, if only due to the lack > of soak time for such drives. There is also no mention of the class of > device. Are we talking about consumer grade MLC? SLC? Are some > vendors' devices better than others? Not all SSDs are created equal. the question of the different vendors and different models of drives compared to each other is something that I can't speak on. My feeling is that there isn't enough history to make any judgements. however, as a class, comparing SSDs to standard rotating media drives I see the elimination of the mechanical portions as being extrememly significant. In any omputer system, the mechanical parts tend to fail _far_ sooner than anything else, and with no warning. high performance rotating hard drives also generate a _lot_ of heat, which hurts the life of the entire system. the failure mode of flash is such that, in general, it will fail when you write to it, not when you read from it. we are definantly in the early stages of SSDs being deployed, and I may find in a couple years that the drives will start to fail on me. but based on the knowledge available now, that seems like a reasonable risk, and the benifit in the meantime (much faster performance at a similar or lower price) makes it a reasonable tradeoff for my normal systems. for a system with critical data on it, I would still use a raid card (with battery backed cache) and redundant SSDs. as always when dealing with bleeding edge technology, you need to make your own risk analysis. David Lang From david at lang.hm Thu Aug 20 05:14:21 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 19 Aug 2009 20:14:21 -0700 (PDT) Subject: [rsyslog] reliability of SSD disks? In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: On Wed, 19 Aug 2009, Aaron Wiebe wrote: > SSD's are as reliable, if not more reliable, than your regular spinning rust. > > But if you want to get even more speed, and just as much reliability, > check out fusionio.com. They're launching a consumer market PCIe card > at 80GB this year (at $800 a card). The technology is -very- cool > though, I got a presentation about it this week (and met the woz!). I have one of their cards, at the moment they are no faster than a normal SSD for rsyslog (in large part due to bottlenecks in rsyslog). in addition, their normal cards are _far_ more expensive for their size than normal SSDs. if they have a new card/price structure I'm glad to hear about it. I was writing them off due to their cost. David Lang > -Aaron > > On Wed, Aug 19, 2009 at 9:11 AM, Rainer > Gerhards wrote: >> Hi all, >> >> quick question to those in the know: are SSD disks considered reliable from >> an auditing (or near audit-grade) point of few? Thank to a hard disk failure, >> I finally got such a disk in my workstation and the performance improvement >> is obviously very good and creates quite a different view about the volume >> that rsyslog can do with "disk" queues. >> >> All thoughts are appreciated (but I have to admit I primarily ask out of >> curiosity). >> >> Rainer >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From david at lang.hm Thu Aug 20 05:35:28 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 19 Aug 2009 20:35:28 -0700 (PDT) Subject: [rsyslog] [OT] Re: reliability of SSD disks? In-Reply-To: <4A8C03B6.9080300@jackpot.uk.net> References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> <4A8C03B6.9080300@jackpot.uk.net> Message-ID: On Wed, 19 Aug 2009, Mr. Demeanour wrote: > Aaron Wiebe wrote: >> SSD's are as reliable, if not more reliable, than your regular >> spinning rust. >> >> But if you want to get even more speed, and just as much reliability, >> check out fusionio.com. They're launching a consumer market PCIe >> card at 80GB this year (at $800 a card). The technology is -very- >> cool though, I got a presentation about it this week (and met the >> woz!). > > So the website (http://www.fusionio.com) doesn't seem to offer any > information on the number of write cycles these devices are able to > sustain. Since they are NAND Flash, this would be expected to be up to > about a million, rendering them of dubious value as a part of a logging > system. With wear-levelling, once cells start dying, I'd expect the > death-rate to climb very rapidly indeed. > > I would have thought that DRAM-based devices would be more suitable in > this role. remember that with wear leveling, that is a million writes to each spot on the drive. even with the 'write magnification' effect (where every write that actually goes to disk requires doing an entire eraseblock, on a 80G drive with 128K blocks that is 640,000,000,000 writes, or if doing the theortical max of 100,000/sec it will only last 6,400,000 seconds, or 74 days. if only writing at 10,000 logs/sec that becomes 740 days or two years if you go with a 160G drive of the same specs the wearout numbers double. another option is to go with a raid card with battery backed cache, if that cache can delay the writes to the flash drive itself so that instead of a fsync for every message causing a write for every 256 characters it does it for every 256K the wearout time for the flash drives skyrockets (74,000 days, or 200 years) Intel claims that their drives only have a write magnification factor of ~1.2:1 or so. to do this they would have to do some caching on the drive. if they really do achieve this sort of result, the lifetime of their drives would be measured in years or decades of service at max write rates yes, for extremely high traffic volumes it may be better to go with something like the ANS-9010 Serial ATA RAM disk http://techreport.com/articles.x/16255/1 but it's significantly more expensive, and it requires a 5.25 drive bay (hard to get in rackmount equipment nowdays), but it is even faster than the flash SSDs David Lang From rgerhards at hq.adiscon.com Thu Aug 20 08:51:59 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 08:51:59 +0200 Subject: [rsyslog] reliability of SSD disks? References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD2F@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Donnerstag, 20. August 2009 05:14 > To: rsyslog-users > Subject: Re: [rsyslog] reliability of SSD disks? > > On Wed, 19 Aug 2009, Aaron Wiebe wrote: > > > SSD's are as reliable, if not more reliable, than your regular > spinning rust. > > > > But if you want to get even more speed, and just as much reliability, > > check out fusionio.com. They're launching a consumer market PCIe > card > > at 80GB this year (at $800 a card). The technology is -very- cool > > though, I got a presentation about it this week (and met the woz!). > > I have one of their cards, at the moment they are no faster than a > normal > SSD for rsyslog (in large part due to bottlenecks in rsyslog). The bottleneck is due to the frequent file open/closes in ultra-reliable mode, I assume? Or anything else? Rainer PS: BTW: thanks everyone for all the good info. It is very interesting to me. I actually was concerned about the write cycles, but as it looks this seems to be far less of a concern than I initially thought. From tbergfeld at hq.adiscon.com Thu Aug 20 10:41:07 2009 From: tbergfeld at hq.adiscon.com (Tom Bergfeld) Date: Thu, 20 Aug 2009 10:41:07 +0200 Subject: [rsyslog] rsyslog 5.1.4 (devel) released Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD37@GRFEXC.intern.adiscon.com> Hi all, We have just released rsyslog 5.1.4, a member of the v5-development branch. This is a refresh of the v5-devel version. It includes a couple of important bug fixes and some minor features. See Changelog for more details. This is a recommended update for all users of the devel branch. Download: http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-170.phtml Changelog: http://www.rsyslog.com/Article392.phtml As always, feedback is appreciated. Best regards, Tom Bergfeld -- Support ======= Improving rsyslog is costly, but you can help! We are looking for organizations that find rsyslog useful and wish to contribute back. You can contribute by reporting bugs, improve the software, or donate money or equipment. Commercial support contracts for rsyslog are available, and they help finance continued maintenance. Adiscon GmbH, a privately held German company, is currently funding rsyslog development. We are always looking for interesting development projects. For details on how to help, please see http://www.rsyslog.com/doc-how2help.html . From david at lang.hm Thu Aug 20 10:44:37 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 01:44:37 -0700 (PDT) Subject: [rsyslog] reliability of SSD disks? In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD2F@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD2F@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of david at lang.hm >> >> On Wed, 19 Aug 2009, Aaron Wiebe wrote: >> >>> SSD's are as reliable, if not more reliable, than your regular >> spinning rust. >>> >>> But if you want to get even more speed, and just as much reliability, >>> check out fusionio.com. They're launching a consumer market PCIe >> card >>> at 80GB this year (at $800 a card). The technology is -very- cool >>> though, I got a presentation about it this week (and met the woz!). >> >> I have one of their cards, at the moment they are no faster than a >> normal >> SSD for rsyslog (in large part due to bottlenecks in rsyslog). > > The bottleneck is due to the frequent file open/closes in ultra-reliable > mode, I assume? Or anything else? that was the obvious one that stood out for me. I suspect that there are others, but we haven't investigated much yet. David Lang From rgerhards at hq.adiscon.com Thu Aug 20 10:51:16 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 10:51:16 +0200 Subject: [rsyslog] Ultra-reliable mode performance - was: RE: reliability of SSD disks? References: <9B6E2A8877C38245BFB15CC491A11DA706FD2A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD2F@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD39@GRFEXC.intern.adiscon.com> > > The bottleneck is due to the frequent file open/closes in ultra- > reliable > > mode, I assume? Or anything else? > > that was the obvious one that stood out for me. I suspect that there > are > others, but we haven't investigated much yet. Thanks, David. During the next weeks I'll check if I can somewhat reduce this cost without changing too much in that driver. Still, the ultimate solution is to write an audit-grade store driver, as we already talked about. I am glad I was able to do a number of simplifications and hope to be able to do a couple of more. With each simplification, new store drivers become easier to implement. This just for info. Rainer From david at lang.hm Thu Aug 20 10:50:16 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 01:50:16 -0700 (PDT) Subject: [rsyslog] rsyslog 5.1.4 (devel) released In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD37@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD37@GRFEXC.intern.adiscon.com> Message-ID: thanks for this update, unfortunantly I will not be able to do much serious testing of the 5.1 branch until the bug that was causing incoming UDP messages to get the incorrect source are fixed. I don't see any mention of that in this changelog. David Lang On Thu, 20 Aug 2009, Tom Bergfeld wrote: > Date: Thu, 20 Aug 2009 10:41:07 +0200 > From: Tom Bergfeld > Reply-To: rsyslog-users > To: rsyslog at lists.adiscon.com > Subject: [rsyslog] rsyslog 5.1.4 (devel) released > > Hi all, > > We have just released rsyslog 5.1.4, a member of the v5-development branch. > This is a refresh of the v5-devel version. It includes a couple of important > bug fixes and some minor features. See Changelog for more details. This is a > recommended update for all users of the devel branch. > > Download: > > http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-170.phtml > > Changelog: > > http://www.rsyslog.com/Article392.phtml > > As always, feedback is appreciated. > > Best regards, > Tom Bergfeld > -- > Support > ======= > > Improving rsyslog is costly, but you can help! We are looking for > organizations that find rsyslog useful and wish to contribute back. You can > contribute by reporting bugs, improve the software, or donate money or > equipment. > > Commercial support contracts for rsyslog are available, and they help finance > continued maintenance. Adiscon GmbH, a privately held German company, is > currently funding rsyslog development. We are always looking for interesting > development projects. For details on how to help, please see > http://www.rsyslog.com/doc-how2help.html . > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Thu Aug 20 10:52:45 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 10:52:45 +0200 Subject: [rsyslog] rsyslog 5.1.4 (devel) released References: <9B6E2A8877C38245BFB15CC491A11DA706FD37@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD3A@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Thursday, August 20, 2009 10:50 AM > To: rsyslog-users > Subject: Re: [rsyslog] rsyslog 5.1.4 (devel) released > > thanks for this update, unfortunantly I will not be able to do much > serious testing of the 5.1 branch until the bug that was causing > incoming > UDP messages to get the incorrect source are fixed. I don't see any > mention of that in this changelog. Maybe I lost this during my vacation, but I have to admit I do not currently know which bug you are talking about. I would appreciate if you could re-post anything that I may have missed. Rainer From david at lang.hm Thu Aug 20 11:04:48 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 02:04:48 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) Message-ID: re-sending ---------- Forwarded message ---------- Date: Fri, 31 Jul 2009 21:53:57 -0700 (PDT) From: david at lang.hm To: rsyslog-users Subject: more 5.1.3 errors I have the following in the config file $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" if $fromhost == '192.168.210.216' then /var/log/scribe1a-p;raw if $fromhost == '192.168.210.217' then /var/log/scribe1a-b;raw if $fromhost == '192.168.210.219' then /var/log/scribe1b-p;raw if $fromhost == '192.168.210.220' then /var/log/scribe1b-b;raw if $fromhost == '192.168.210.222' then /var/log/scribe1c-p;raw if $fromhost == '192.168.210.223' then /var/log/scribe1c-b;raw if $fromhost == '192.168.210.245' then /var/log/scribe1d-p;raw but if I do a tail of these files I get very wierd results I have some logs in the wrong files, and I have some of them where the fromhost in in the hostname (and the hostname is in the syslogtag) the second error seems fairly consistant with a given source, unfortunantly the worst offender is another rsyslog 5.1.3 box. this first example shows the sceibe1b boxes with the incorrect hostname and system tag (scribe1b is the other rsyslog box, the one showing the problem) # tail scribe1* ==> scribe1a-b <== <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: unable to open S/MIME certificate '/var/spool/certs/chris.cournoyer at digitalinsight.com' 192.168.210.217 192.168.242.126 smelter <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad certificate 192.168.210.217 192.168.242.126 smelter ==> scribe1a-p <== <13>Jul 31 21:39:01 scribe1a-p getprocs: 28 /proc/net/tcp= 192.168.210.216 192.168.210.216 scribe1a-p <13>Jul 31 21:39:01 scribe1a-p getprocs: 138=9 /usr/sbin/apache=9 sleep 30=2 [pdflush]=2 /bin/bash /usr/local/bin/getprocs=1 [xfs_mru_cache]=1 [xfslogd/3]=1 [xfslogd/2]=1 [xfslogd/1]=1 [xfslogd/0]=1 [xfsdatad/3]= 192.168.210.216 192.168.210.216 scribe1a-p ==> scribe1b-b <== <13>Aug 1 04:39:01 scribe1b-b getprocs: 133=9 sleep 30=3 /usr/sbin/argus -w /var/log/argus/argus.log -n /var/run/argus.pid=3 /bin/bash /usr/local/bin/getprocs=2 [xfssyncd]=2 [xfsbufd]=2 [xfsaild]=2 [pdflush]=1 uniq -c=1 sort -rn=1 ps ax= 192.168.210.220 192.168.210.220 scribe1b-b <86>Aug 1 04:39:01 scribe1b-b CRON[21219]: pam_unix(cron:session): session closed for user root 192.168.210.220 192.168.210.220 scribe1b-b ==> scribe1b-p <== <13>Aug 1 00:40:14 MSWinEventLog\0111\011Applicatio Aug 01 00:39:57 2009\0111008\011Perflib\011Unknown User\011N/A\011Error\011BANKINGPDC1\011None\0110000: 68 10 00 00 78 bf 94 01 ...... \011The Open Procedure for service "PerfDisk" in DLL "C:\WINNT\system32\perfdisk.dll" failed. Performance data for this service will not be available. Status code returned is data DWORD 0. \01120258586192.168.210.219 192.168.210.219 MSWinEventLog\0111\011Applicatio <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect host= /192.168.242.211 destination=179.50.100.127/11282 in=3274 out=1448 duration=0 192.168.210.219 192.168.210.219 methane1e-b ==> scribe1c-b <== ==> scribe1c-p <== <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain=htpp,hostname=warden1-p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dstip=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_client=0,bytes_written_to_server=0,service_name=httpp,status=conn_close,acl_id=Warden__Outbound-DEV-NET,cache_hit=1,request_status=0,start_time="Fri Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 192.168.210.222 10.202.0.252 auditd: <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain=htpp,hostname=warden1-p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dstip=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_client=0,bytes_written_to_server=0,service_name=httpp,status=conn_close,acl_id=Warden__Outbound-DEV-NET,cache_hit=1,request_status=0,start_time="Fri Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 192.168.210.222 10.202.0.252 auditd: ==> scribe1d-p <== <175>Aug 1 00:39:22 172.20.254.6 ^A MSWinEventLog^I1^ISecurity^I343780120^IFri Jul 31 18:20:25 2009^I540^ISecurity^Idataman^IUser^ISuccess Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777242 192.168.210.245 172.20.254.6 ^A <175>Aug 1 00:39:22 172.20.254.6 ^A MSWinEventLog^I1^ISecurity^I343780121^IFri Jul 31 18:20:25 2009^I538^ISecurity^Idataman^IUser^ISuccess Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777243 192.168.210.245 172.20.254.6 ^A an example of the second problem is log entries like this <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= /192.168.243.38 destination=179.50.100.130/11074 192.168.210.245 192.168.210.245 methane1d-b the problem is that the log file on the .245 box (which log *.* to messages) don't show anything like this, and the methane1d-b box doesn't have any networks in common with the .245 box David Lang From rgerhards at hq.adiscon.com Thu Aug 20 11:06:42 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 11:06:42 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD3D@GRFEXC.intern.adiscon.com> David, not analysed (note even read) the mail in detail, but shouldn't you query fromhost-ip instead of fromhost? Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Thursday, August 20, 2009 11:05 AM > To: rsyslog-users > Subject: [rsyslog] more 5.1.3 errors (fwd) > > re-sending > > ---------- Forwarded message ---------- > Date: Fri, 31 Jul 2009 21:53:57 -0700 (PDT) > From: david at lang.hm > To: rsyslog-users > Subject: more 5.1.3 errors > > I have the following in the config file > > $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" > if $fromhost == '192.168.210.216' then /var/log/scribe1a-p;raw > if $fromhost == '192.168.210.217' then /var/log/scribe1a-b;raw > if $fromhost == '192.168.210.219' then /var/log/scribe1b-p;raw > if $fromhost == '192.168.210.220' then /var/log/scribe1b-b;raw > if $fromhost == '192.168.210.222' then /var/log/scribe1c-p;raw > if $fromhost == '192.168.210.223' then /var/log/scribe1c-b;raw > if $fromhost == '192.168.210.245' then /var/log/scribe1d-p;raw > > > but if I do a tail of these files I get very wierd results > > I have some logs in the wrong files, and I have some of them where the > fromhost > in in the hostname (and the hostname is in the syslogtag) > > the second error seems fairly consistant with a given source, > unfortunantly the > worst offender is another rsyslog 5.1.3 box. > > this first example shows the sceibe1b boxes with the incorrect hostname > and > system tag (scribe1b is the other rsyslog box, the one showing the > problem) > > # tail scribe1* > ==> scribe1a-b <== > <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > n714dL7N010869: > unable to open S/MIME certificate > '/var/spool/certs/chris.cournoyer at digitalinsight.com' > > 192.168.210.217 192.168.242.126 smelter > > > <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > n714dL7N010869: > unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad > certificate > > 192.168.210.217 192.168.242.126 smelter > > > > ==> scribe1a-p <== > <13>Jul 31 21:39:01 scribe1a-p getprocs: 28 /proc/net/tcp= > > 192.168.210.216 192.168.210.216 scribe1a-p > > > <13>Jul 31 21:39:01 scribe1a-p getprocs: 138=9 /usr/sbin/apache=9 sleep > 30=2 > [pdflush]=2 /bin/bash /usr/local/bin/getprocs=1 [xfs_mru_cache]=1 > [xfslogd/3]=1 > [xfslogd/2]=1 [xfslogd/1]=1 [xfslogd/0]=1 [xfsdatad/3]= > > 192.168.210.216 192.168.210.216 scribe1a-p > > > > ==> scribe1b-b <== > <13>Aug 1 04:39:01 scribe1b-b getprocs: 133=9 sleep 30=3 > /usr/sbin/argus -w > /var/log/argus/argus.log -n /var/run/argus.pid=3 /bin/bash > /usr/local/bin/getprocs=2 [xfssyncd]=2 [xfsbufd]=2 [xfsaild]=2 > [pdflush]=1 uniq > -c=1 sort -rn=1 ps ax= > > 192.168.210.220 192.168.210.220 scribe1b-b > > > <86>Aug 1 04:39:01 scribe1b-b CRON[21219]: pam_unix(cron:session): > session > closed for user root > > 192.168.210.220 192.168.210.220 scribe1b-b > > > > ==> scribe1b-p <== > > <13>Aug 1 00:40:14 MSWinEventLog\0111\011Applicatio Aug 01 00:39:57 > 2009\0111008\011Perflib\011Unknown > User\011N/A\011Error\011BANKINGPDC1\011None\0110000: 68 10 00 00 78 bf > 94 01 > ...... \011The Open Procedure for service "PerfDisk" in DLL > "C:\WINNT\system32\perfdisk.dll" failed. Performance data for this > service > will not be available. Status code returned is data DWORD 0. > \01120258586192.168.210.219 192.168.210.219 > MSWinEventLog\0111\011Applicatio > > > <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect host= > /192.168.242.211 destination=179.50.100.127/11282 in=3274 out=1448 > duration=0 > > 192.168.210.219 192.168.210.219 methane1e-b > > > > ==> scribe1c-b <== > > ==> scribe1c-p <== > <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 > GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major > ,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain > =htpp,hostname=warden1- > p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dsti > p=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_c > lient=0,bytes_written_to_server=0,service_name=httpp,status=conn_close, > acl_id=Warden__Outbound-DEV- > NET,cache_hit=1,request_status=0,start_time="Fri > Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 > > 192.168.210.222 10.202.0.252 auditd: > > > <131>Jul 31 21:39:20 10.202.0.252 auditd: date="Aug 1 04:39:20 2009 > GMT",fac=f_wwwproxy,area=a_libproxycommon,type=t_nettraffic,pri=p_major > ,pid=1013,ruid=0,euid=0,pgid=1013,logid=0,cmd=httpp,domain=htpp,edomain > =htpp,hostname=warden1- > p.diginsight.com,srcip=10.202.0.252,srcport=23865,srcburb=internal,dsti > p=10.21.48.30,dstport=80,dstburb=internal,protocol=6,bytes_written_to_c > lient=0,bytes_written_to_server=0,service_name=httpp,status=conn_close, > acl_id=Warden__Outbound-DEV- > NET,cache_hit=1,request_status=0,start_time="Fri > Jul 31 21:38:18 2009",netsessid=4a73c6ba0001d7d3 > > 192.168.210.222 10.202.0.252 auditd: > > > > ==> scribe1d-p <== > <175>Aug 1 00:39:22 172.20.254.6 ^A > MSWinEventLog^I1^ISecurity^I343780120^IFri > Jul 31 18:20:25 2009^I540^ISecurity^Idataman^IUser^ISuccess > Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777242 > > 192.168.210.245 172.20.254.6 ^A > > > <175>Aug 1 00:39:22 172.20.254.6 ^A > MSWinEventLog^I1^ISecurity^I343780121^IFri > Jul 31 18:20:25 2009^I538^ISecurity^Idataman^IUser^ISuccess > Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777243 > > 192.168.210.245 172.20.254.6 ^A > > > > an example of the second problem is log entries like this > > <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > /192.168.243.38 > destination=179.50.100.130/11074 > > 192.168.210.245 192.168.210.245 methane1d-b > > > the problem is that the log file on the .245 box (which log *.* to > messages) > don't show anything like this, and the methane1d-b box doesn't have any > networks in common with the .245 box > > > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Thu Aug 20 11:14:09 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 11:14:09 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD3E@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Thursday, August 20, 2009 11:05 AM > To: rsyslog-users > Subject: [rsyslog] more 5.1.3 errors (fwd) > > re-sending > > ---------- Forwarded message ---------- > Date: Fri, 31 Jul 2009 21:53:57 -0700 (PDT) > From: david at lang.hm > To: rsyslog-users > Subject: more 5.1.3 errors > > I have the following in the config file > > $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" > if $fromhost == '192.168.210.216' then /var/log/scribe1a-p;raw > if $fromhost == '192.168.210.217' then /var/log/scribe1a-b;raw > if $fromhost == '192.168.210.219' then /var/log/scribe1b-p;raw > if $fromhost == '192.168.210.220' then /var/log/scribe1b-b;raw > if $fromhost == '192.168.210.222' then /var/log/scribe1c-p;raw > if $fromhost == '192.168.210.223' then /var/log/scribe1c-b;raw > if $fromhost == '192.168.210.245' then /var/log/scribe1d-p;raw > > > but if I do a tail of these files I get very wierd results > > I have some logs in the wrong files, and I have some of them where the > fromhost > in in the hostname (and the hostname is in the syslogtag) > > the second error seems fairly consistant with a given source, > unfortunantly the > worst offender is another rsyslog 5.1.3 box. > > this first example shows the sceibe1b boxes with the incorrect hostname > and > system tag (scribe1b is the other rsyslog box, the one showing the > problem) > > # tail scribe1* > ==> scribe1a-b <== > <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: unable to open S/MIME certificate This looks strange. Is this from rsyslog with default templates? If so, is the sender actually relaying data from some other source? I am asking because the only reason I can think of why there is an IP address in front of the hostname is that an original sender is issuing a malformed message, this is received and re-interpreted by rsyslogd, which then sends out a message in "invalid" format because the parser populated the wrong fields (and thus resulting in what you see on the ultimate end system). I have to admit I am heavily puzzled ;) Rainer From david at lang.hm Thu Aug 20 11:18:00 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 02:18:00 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD3E@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD3E@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of david at lang.hm >> re-sending >> >> ---------- Forwarded message ---------- >> Date: Fri, 31 Jul 2009 21:53:57 -0700 (PDT) >> From: david at lang.hm >> To: rsyslog-users >> Subject: more 5.1.3 errors >> >> I have the following in the config file >> >> $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" >> if $fromhost == '192.168.210.216' then /var/log/scribe1a-p;raw >> if $fromhost == '192.168.210.217' then /var/log/scribe1a-b;raw >> if $fromhost == '192.168.210.219' then /var/log/scribe1b-p;raw >> if $fromhost == '192.168.210.220' then /var/log/scribe1b-b;raw >> if $fromhost == '192.168.210.222' then /var/log/scribe1c-p;raw >> if $fromhost == '192.168.210.223' then /var/log/scribe1c-b;raw >> if $fromhost == '192.168.210.245' then /var/log/scribe1d-p;raw >> >> >> but if I do a tail of these files I get very wierd results >> >> I have some logs in the wrong files, and I have some of them where the >> fromhost >> in in the hostname (and the hostname is in the syslogtag) >> >> the second error seems fairly consistant with a given source, >> unfortunantly the >> worst offender is another rsyslog 5.1.3 box. >> >> this first example shows the sceibe1b boxes with the incorrect hostname >> and >> system tag (scribe1b is the other rsyslog box, the one showing the >> problem) >> >> # tail scribe1* >> ==> scribe1a-b <== >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: n714dL7N010869: > unable to open S/MIME certificate > > This looks strange. Is this from rsyslog with default templates? If so, is > the sender actually relaying data from some other source? I am asking because > the only reason I can think of why there is an IP address in front of the > hostname is that an original sender is issuing a malformed message, this is > received and re-interpreted by rsyslogd, which then sends out a message in > "invalid" format because the parser populated the wrong fields (and thus > resulting in what you see on the ultimate end system). I have to admit I am > heavily puzzled ;) smelter is the syslogtag, not the machine name. David Lang From rgerhards at hq.adiscon.com Thu Aug 20 11:25:39 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 11:25:39 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> > > this first example shows the sceibe1b boxes with the incorrect hostname > and > system tag (scribe1b is the other rsyslog box, the one showing the > problem) > > # tail scribe1* > ==> scribe1a-b <== > <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > n714dL7N010869: > unable to open S/MIME certificate > '/var/spool/certs/chris.cournoyer at digitalinsight.com' > > 192.168.210.217 192.168.242.126 smelter > > > <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > n714dL7N010869: > unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad > certificate > > 192.168.210.217 192.168.242.126 smelter mmhhh... if "smelter" is the tag, the issue is that fromhost is ending in 217 but the hostname reported is 126? If so, may this box be multihomed? rsyslog (should ;)) use simply what is provided to it, and it looks like that was .126 (to be shown ;)). But first things first: is my understanding of the failure scenario correct? Rainer From rgerhards at hq.adiscon.com Thu Aug 20 11:35:58 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 11:35:58 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD42@GRFEXC.intern.adiscon.com> David, some more comments... > <175>Aug 1 00:39:22 172.20.254.6 ^A > MSWinEventLog^I1^ISecurity^I343780121^IFri > Jul 31 18:20:25 2009^I538^ISecurity^Idataman^IUser^ISuccess > Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777243 > > 192.168.210.245 172.20.254.6 ^A provided that 192.168.210.245 is the correct sender address as seen by the receiver (NAT?), this message looks good (^A actually is the tag, even though the sender has probably not intended it to be a tag...) > > > > an example of the second problem is log entries like this > > <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > /192.168.243.38 > destination=179.50.100.130/11074 > > 192.168.210.245 192.168.210.245 methane1d-b > > > the problem is that the log file on the .245 box (which log *.* to > messages) > don't show anything like this, and the methane1d-b box doesn't have any > networks in common with the .245 box I don't get any grip on this. Would it be possible to provide (privately) debug log files for this processing? I have really a hard time figuring out what's going on there, and I am not sure if some unprintable characters are part of the picture. Only the debug log will show me that... Rainer From david at lang.hm Thu Aug 20 11:48:13 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 02:48:13 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: >> this first example shows the sceibe1b boxes with the incorrect hostname >> and >> system tag (scribe1b is the other rsyslog box, the one showing the >> problem) >> >> # tail scribe1* >> ==> scribe1a-b <== >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: >> n714dL7N010869: >> unable to open S/MIME certificate >> '/var/spool/certs/chris.cournoyer at digitalinsight.com' >> >> 192.168.210.217 192.168.242.126 smelter >> >> >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: >> n714dL7N010869: >> unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad >> certificate >> >> 192.168.210.217 192.168.242.126 smelter > > > mmhhh... if "smelter" is the tag, the issue is that fromhost is ending in 217 > but the hostname reported is 126? If so, may this box be multihomed? rsyslog > (should ;)) use simply what is provided to it, and it looks like that was > .126 (to be shown ;)). But first things first: is my understanding of the > failure scenario correct? all the 192.168.210.x servers are relays. the template is $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" so it displays the raw message, then fromhost, hostname, syslogtag the scribe1a messages you quoted here are showing the right thing. these messages from scribe1b-p however do not <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect host=/192.168.242.211 destination=179.50.100.127/11282 in=3274 out=1448 duration=0 192.168.210.219 192.168.210.219 methane1e-b as far as I can tell, this is a properly formatted message relayed from methane1e-b by scribe1b-p (192.168.210.219 running rsyslog), but after being parsed it puts the hostname from the message in the syslog tag and puts the scribe1b-p ip address in the hostname the second problem from my initial e-mail (and the one I mentioned in response to the 5.1.4 release) is pointed out by this portion of my initial e-mail <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= /192.168.243.38 destination=179.50.100.130/11074 192.168.210.245 192.168.210.245 methane1d-b in addition to not parsing the message correctly and putting the hostnmae in the syslogtag field, the fromhost is incorrect. this message could only have gotten here by being relayed from the .219 box. the log file on the .245 box (which logs *.* to messages) don't show anything like this, and the methane1d-b box doesn't have any networks in common with the .245 box David Lang From david at lang.hm Thu Aug 20 11:51:04 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 02:51:04 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD42@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD42@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: > David, > > some more comments... > >> <175>Aug 1 00:39:22 172.20.254.6 ^A >> MSWinEventLog^I1^ISecurity^I343780121^IFri >> Jul 31 18:20:25 2009^I538^ISecurity^Idataman^IUser^ISuccess >> Audit^IOPSMON01^ILogon/Logoff^I^Idataman^I343777243 >> >> 192.168.210.245 172.20.254.6 ^A > > provided that 192.168.210.245 is the correct sender address as seen by the > receiver (NAT?), this message looks good (^A actually is the tag, even though > the sender has probably not intended it to be a tag...) yes it is. in retrospect I should have only shown bad messages. unfortunantly I just did a tail of all of the files, checked that they included errors and sent them all. part of my reasoning was to show that I can't see any difference between messages that work and ones that don't. >> >> >> >> an example of the second problem is log entries like this >> >> <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= >> /192.168.243.38 >> destination=179.50.100.130/11074 >> >> 192.168.210.245 192.168.210.245 methane1d-b >> >> >> the problem is that the log file on the .245 box (which log *.* to >> messages) >> don't show anything like this, and the methane1d-b box doesn't have any >> networks in common with the .245 box > > I don't get any grip on this. Would it be possible to provide (privately) > debug log files for this processing? I have really a hard time figuring out > what's going on there, and I am not sure if some unprintable characters are > part of the picture. Only the debug log will show me that... I will see what I can do. I'm home sick for the rest of the week, so I don't know how much I'll be able to test anything. David Lang From rgerhards at hq.adiscon.com Thu Aug 20 13:04:46 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 13:04:46 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: <9B6E2A8877C38245BFB15CC491A11DA706FD42@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD44@GRFEXC.intern.adiscon.com> > > I don't get any grip on this. Would it be possible to provide > (privately) > > debug log files for this processing? I have really a hard time > figuring out > > what's going on there, and I am not sure if some unprintable > characters are > > part of the picture. Only the debug log will show me that... > > I will see what I can do. I'm home sick for the rest of the week, so I > don't know how much I'll be able to test anything. I am sad to hear this. But I think there is no need to hurry so much. I'd expect that I get a firm understanding of the situation as soon as I get debug logs and then it should not be hard to produce a fix (if things don't turn out to be really cracy ;)). So if you can send them next week, I'd hope that we get a fix within a day or two. In the mean time, I'll look at the other information you provided and see if I can reproduce the behavior. Rainer From rgerhards at hq.adiscon.com Thu Aug 20 13:15:19 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 13:15:19 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD47@GRFEXC.intern.adiscon.com> ah, hold on, I may be able to reproduce an issue with one of the messages flagged bad :) > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Thursday, August 20, 2009 11:48 AM > To: rsyslog-users > Subject: Re: [rsyslog] more 5.1.3 errors (fwd) > > On Thu, 20 Aug 2009, Rainer Gerhards wrote: > > >> this first example shows the sceibe1b boxes with the incorrect > hostname > >> and > >> system tag (scribe1b is the other rsyslog box, the one showing the > >> problem) > >> > >> # tail scribe1* > >> ==> scribe1a-b <== > >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > >> n714dL7N010869: > >> unable to open S/MIME certificate > >> '/var/spool/certs/chris.cournoyer at digitalinsight.com' > >> > >> 192.168.210.217 192.168.242.126 smelter > >> > >> > >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > >> n714dL7N010869: > >> unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad > >> certificate > >> > >> 192.168.210.217 192.168.242.126 smelter > > > > > > mmhhh... if "smelter" is the tag, the issue is that fromhost is > ending in 217 > > but the hostname reported is 126? If so, may this box be multihomed? > rsyslog > > (should ;)) use simply what is provided to it, and it looks like that > was > > .126 (to be shown ;)). But first things first: is my understanding of > the > > failure scenario correct? > > all the 192.168.210.x servers are relays. > > the template is > $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" > > so it displays the raw message, then fromhost, hostname, syslogtag > > the scribe1a messages you quoted here are showing the right thing. > > these messages from scribe1b-p however do not > > <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect > host=/192.168.242.211 destination=179.50.100.127/11282 in=3274 out=1448 > duration=0 > > 192.168.210.219 192.168.210.219 methane1e-b > > as far as I can tell, this is a properly formatted message relayed from > methane1e-b by scribe1b-p (192.168.210.219 running rsyslog), but after > being parsed it puts the hostname from the message in the syslog tag > and > puts the scribe1b-p ip address in the hostname > > > the second problem from my initial e-mail (and the one I mentioned in > response to the 5.1.4 release) is pointed out by this portion of my > initial e-mail > > <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > /192.168.243.38 destination=179.50.100.130/11074 > > 192.168.210.245 192.168.210.245 methane1d-b > > > in addition to not parsing the message correctly and putting the > hostnmae > in the syslogtag field, the fromhost is incorrect. this message could > only > have gotten here by being relayed from the .219 box. the log file on > the > .245 box (which logs *.* to messages) don't show anything like this, > and > the methane1d-b box doesn't have any networks in common with the .245 > box > > > David Lang > > > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Thu Aug 20 14:11:51 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 14:11:51 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD47@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD48@GRFEXC.intern.adiscon.com> I am now almost 100% sure it is a regression from this change: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=86e37f70fe0e9de0e0036299 0c73536843c8fef3 As it looks, I forgot to add the dash as a permitted character in hostnames, thus it triggers the logic that says "invalid hostname, so it must be a tag". Will see what I need to fix... Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > Sent: Thursday, August 20, 2009 1:15 PM > To: rsyslog-users > Subject: Re: [rsyslog] more 5.1.3 errors (fwd) > > ah, hold on, I may be able to reproduce an issue with one of the > messages > flagged bad :) > > > -----Original Message----- > > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > > Sent: Thursday, August 20, 2009 11:48 AM > > To: rsyslog-users > > Subject: Re: [rsyslog] more 5.1.3 errors (fwd) > > > > On Thu, 20 Aug 2009, Rainer Gerhards wrote: > > > > >> this first example shows the sceibe1b boxes with the incorrect > > hostname > > >> and > > >> system tag (scribe1b is the other rsyslog box, the one showing the > > >> problem) > > >> > > >> # tail scribe1* > > >> ==> scribe1a-b <== > > >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > > >> n714dL7N010869: > > >> unable to open S/MIME certificate > > >> '/var/spool/certs/chris.cournoyer at digitalinsight.com' > > >> > > >> 192.168.210.217 192.168.242.126 smelter > > >> > > >> > > >> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: > > >> n714dL7N010869: > > >> unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad > > >> certificate > > >> > > >> 192.168.210.217 192.168.242.126 smelter > > > > > > > > > mmhhh... if "smelter" is the tag, the issue is that fromhost is > > ending in 217 > > > but the hostname reported is 126? If so, may this box be > multihomed? > > rsyslog > > > (should ;)) use simply what is provided to it, and it looks like > that > > was > > > .126 (to be shown ;)). But first things first: is my understanding > of > > the > > > failure scenario correct? > > > > all the 192.168.210.x servers are relays. > > > > the template is > > $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" > > > > so it displays the raw message, then fromhost, hostname, syslogtag > > > > the scribe1a messages you quoted here are showing the right thing. > > > > these messages from scribe1b-p however do not > > > > <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect > > host=/192.168.242.211 destination=179.50.100.127/11282 in=3274 > out=1448 > > duration=0 > > > > 192.168.210.219 192.168.210.219 methane1e-b > > > > as far as I can tell, this is a properly formatted message relayed > from > > methane1e-b by scribe1b-p (192.168.210.219 running rsyslog), but > after > > being parsed it puts the hostname from the message in the syslog tag > > and > > puts the scribe1b-p ip address in the hostname > > > > > > the second problem from my initial e-mail (and the one I mentioned in > > response to the 5.1.4 release) is pointed out by this portion of my > > initial e-mail > > > > <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > > /192.168.243.38 destination=179.50.100.130/11074 > > > > 192.168.210.245 192.168.210.245 methane1d-b > > > > > > in addition to not parsing the message correctly and putting the > > hostnmae > > in the syslogtag field, the fromhost is incorrect. this message could > > only > > have gotten here by being relayed from the .219 box. the log file on > > the > > .245 box (which logs *.* to messages) don't show anything like this, > > and > > the methane1d-b box doesn't have any networks in common with the .245 > > box > > > > > > David Lang > > > > > > > > > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From david at lang.hm Thu Aug 20 14:15:42 2009 From: david at lang.hm (david at lang.hm) Date: Thu, 20 Aug 2009 05:15:42 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD48@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD47@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD48@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: > I am now almost 100% sure it is a regression from this change: > > http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=86e37f70fe0e9de0e0036299 > 0c73536843c8fef3 > > As it looks, I forgot to add the dash as a permitted character in hostnames, > thus it triggers the logic that says "invalid hostname, so it must be a tag". > > Will see what I need to fix... that would definantly cause me problems (_lots_ of my hostnames have a dash in them) it doesn't explain the second issue where logs that are relayed through one machine end up showing that they were relayed through a different one. David Lang > Rainer > >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards >> Sent: Thursday, August 20, 2009 1:15 PM >> To: rsyslog-users >> Subject: Re: [rsyslog] more 5.1.3 errors (fwd) >> >> ah, hold on, I may be able to reproduce an issue with one of the >> messages >> flagged bad :) >> >>> -----Original Message----- >>> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >>> bounces at lists.adiscon.com] On Behalf Of david at lang.hm >>> Sent: Thursday, August 20, 2009 11:48 AM >>> To: rsyslog-users >>> Subject: Re: [rsyslog] more 5.1.3 errors (fwd) >>> >>> On Thu, 20 Aug 2009, Rainer Gerhards wrote: >>> >>>>> this first example shows the sceibe1b boxes with the incorrect >>> hostname >>>>> and >>>>> system tag (scribe1b is the other rsyslog box, the one showing the >>>>> problem) >>>>> >>>>> # tail scribe1* >>>>> ==> scribe1a-b <== >>>>> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: >>>>> n714dL7N010869: >>>>> unable to open S/MIME certificate >>>>> '/var/spool/certs/chris.cournoyer at digitalinsight.com' >>>>> >>>>> 192.168.210.217 192.168.242.126 smelter >>>>> >>>>> >>>>> <22>Jul 31 21:39:21 192.168.242.126 smelter v0.88.5[23535]: >>>>> n714dL7N010869: >>>>> unable to add rcpt 'chris.cournoyer at digitalinsight.com' :: bad >>>>> certificate >>>>> >>>>> 192.168.210.217 192.168.242.126 smelter >>>> >>>> >>>> mmhhh... if "smelter" is the tag, the issue is that fromhost is >>> ending in 217 >>>> but the hostname reported is 126? If so, may this box be >> multihomed? >>> rsyslog >>>> (should ;)) use simply what is provided to it, and it looks like >> that >>> was >>>> .126 (to be shown ;)). But first things first: is my understanding >> of >>> the >>>> failure scenario correct? >>> >>> all the 192.168.210.x servers are relays. >>> >>> the template is >>> $template raw,"%rawmsg%\n%fromhost% %hostname% %syslogtag%\n\n\n" >>> >>> so it displays the raw message, then fromhost, hostname, syslogtag >>> >>> the scribe1a messages you quoted here are showing the right thing. >>> >>> these messages from scribe1b-p however do not >>> >>> <29>Jul 31 21:39:21 methane1e-b plug-gw[10538]: disconnect >>> host=/192.168.242.211 destination=179.50.100.127/11282 in=3274 >> out=1448 >>> duration=0 >>> >>> 192.168.210.219 192.168.210.219 methane1e-b >>> >>> as far as I can tell, this is a properly formatted message relayed >> from >>> methane1e-b by scribe1b-p (192.168.210.219 running rsyslog), but >> after >>> being parsed it puts the hostname from the message in the syslog tag >>> and >>> puts the scribe1b-p ip address in the hostname >>> >>> >>> the second problem from my initial e-mail (and the one I mentioned in >>> response to the 5.1.4 release) is pointed out by this portion of my >>> initial e-mail >>> >>> <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= >>> /192.168.243.38 destination=179.50.100.130/11074 >>> >>> 192.168.210.245 192.168.210.245 methane1d-b >>> >>> >>> in addition to not parsing the message correctly and putting the >>> hostnmae >>> in the syslogtag field, the fromhost is incorrect. this message could >>> only >>> have gotten here by being relayed from the .219 box. the log file on >>> the >>> .245 box (which logs *.* to messages) don't show anything like this, >>> and >>> the methane1d-b box doesn't have any networks in common with the .245 >>> box >>> >>> >>> David Lang >>> >>> >>> >>> >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Thu Aug 20 14:30:16 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 14:30:16 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD47@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD48@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD49@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Thursday, August 20, 2009 2:16 PM > To: rsyslog-users > Subject: Re: [rsyslog] more 5.1.3 errors (fwd) > > On Thu, 20 Aug 2009, Rainer Gerhards wrote: > > > I am now almost 100% sure it is a regression from this change: > > > > > http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=86e37f70fe0e9de0e0 > 036299 > > 0c73536843c8fef3 > > > > As it looks, I forgot to add the dash as a permitted character in > hostnames, > > thus it triggers the logic that says "invalid hostname, so it must be > a tag". > > > > Will see what I need to fix... > > that would definantly cause me problems (_lots_ of my hostnames have a > dash in them) Yes, that was the cause of this problem. I have now fixed it in all affected versions (inside the git branches, of course). The trivial patch is here: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=daa76ad94428599336ddafdd 6854dc0b71356180 > it doesn't explain the second issue where logs that are relayed through > one machine end up showing that they were relayed through a different > one. One at a time ;) I'll try to find out more about this, but I fear here I really need a debug log. I suspect this is something that is triggered by the environment... Rainer From rgerhards at hq.adiscon.com Thu Aug 20 14:48:40 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 20 Aug 2009 14:48:40 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> > the second problem from my initial e-mail (and the one I mentioned in > response to the 5.1.4 release) is pointed out by this portion of my > initial e-mail > > <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > /192.168.243.38 destination=179.50.100.130/11074 > > 192.168.210.245 192.168.210.245 methane1d-b > > > in addition to not parsing the message correctly and putting the > hostnmae > in the syslogtag field, the fromhost is incorrect. this message could > only > have gotten here by being relayed from the .219 box. the log file on > the > .245 box (which logs *.* to messages) don't show anything like this, > and > the methane1d-b box doesn't have any networks in common with the .245 > box Is there any possibility that *any* message was previously received from .245? From what you said, I assume not. I am asking because the problem may be related to the name lookup reuse technique (which re-uses the sender IP if it is the same as with the last UDP packet). However, the problem can only be rooted in that area if .245 is a valid sender at least at times. I have also done a source code review right now, but I do not see any suspicious. Will continue to try a little bit, but that one does probably need to wait until we can do some debugging in your environment. Rainer From oribani at gmail.com Fri Aug 21 02:38:19 2009 From: oribani at gmail.com (Ori Bani) Date: Thu, 20 Aug 2009 17:38:19 -0700 Subject: [rsyslog] Arbitrary string replacements Message-ID: <378058110908201738h689bb025j2e95850e5cdff822@mail.gmail.com> Hi, I understand that arbitrary replacements on log messages is not supported by rsyslog. I found a thread that explains it here: http://lists.adiscon.net/pipermail/rsyslog/2009-June/002317.html I'd like to give my vote for adding this feature. I have the same requirement (or similar) to the OP of that thread. For now, I have to use syslog-ng, which I understand has recently already implemented this feature, or if I want to use rsyslog, I have to drop (discard) the messages that have information that I am not allowed to keep in my logs (that with IP addreses): # This discards any message with an IP (ver. 4) address in it :msg, regex, "[0-9]\.[0-9]\.[0-9]\.[0-9]" ~ From oribani at gmail.com Fri Aug 21 02:53:41 2009 From: oribani at gmail.com (Ori Bani) Date: Thu, 20 Aug 2009 17:53:41 -0700 Subject: [rsyslog] Need help with RPM(yum) version on CentOS Message-ID: <378058110908201753v41c58b4fx401efda639d058e4@mail.gmail.com> Hi, I'm sorry if this isn't quite the right place to ask, since maybe no one here created the RPM that's in the CentOS base repository. But I am guessing people here have installed RPMs like this before and can help anyway.... When I ask yum on CentOS 5 about rsyslog, I get this (note older version - too bad): Available Packages Name : rsyslog Arch : i386 Version: 2.0.6 Release: 1.el5 Size : 198 k Repo : base Summary: Enhanced system logging and kernel message trapping daemons Description: Rsyslog is an enhanced multi-threaded syslogd supporting, among others, MySQL, syslog/tcp, RFC 3195, permitted sender lists, filtering on any message part, and fine grain output format control. It is quite compatible to stock sysklogd and can be used as a drop-in replacement. Its advanced features make it suitable for enterprise-class, encryption protected syslog relay chains while at the same time being very easy to setup for the novice user. My questions are a little bit newbie... before I try installing this, I want to know what it's going to do to my system: 1) Will it disable syslogd and/or klogd? Or will it add itself using the "alternatives" paradigm so I can switch between them that way? If neither, does it include startup scripts at all? If they are there but not used by default, is there a recommended way to make the switch and not really screw things up? 2) Will it add itself to my cron jobs? Specifically, I don't mind (for now) leaving the log rotation alone (don't let rsyslog manage my rotations). If it adds itself to my cron jobs, does that mean it will remove the logrotate cron job? 2.5) If I keep using the old logrotate with rsyslog, will that create any conflicts? Generally my aim is not to commit 100% to rsyslog yet, so I don't want to get to a situation where it's a lot of work to get back to the default syslog setup. From mic at npgx.com.au Fri Aug 21 04:08:30 2009 From: mic at npgx.com.au (Michael Mansour) Date: Fri, 21 Aug 2009 13:08:30 +1100 Subject: [rsyslog] Need help with RPM(yum) version on CentOS In-Reply-To: <378058110908201753v41c58b4fx401efda639d058e4@mail.gmail.com> References: <378058110908201753v41c58b4fx401efda639d058e4@mail.gmail.com> Message-ID: <20090821015920.M76525@npgx.com.au> Hi Ori, > Hi, > > I'm sorry if this isn't quite the right place to ask, since maybe no > one here created the RPM that's in the CentOS base repository. But I > am guessing people here have installed RPMs like this before and can > help anyway.... > > When I ask yum on CentOS 5 about rsyslog, I get this (note older > version - too bad): > > Available Packages > Name : rsyslog > Arch : i386 > Version: 2.0.6 > Release: 1.el5 > Size : 198 k > Repo : base > Summary: Enhanced system logging and kernel message trapping daemons > Description: > Rsyslog is an enhanced multi-threaded syslogd supporting, among > others, MySQL, syslog/tcp, RFC 3195, permitted sender lists, > filtering on any message part, and fine grain output format control. > It is quite compatible to stock sysklogd and can be used as a drop- > in replacement. Its advanced features make it suitable for > enterprise-class, encryption protected syslog relay chains while at > the same time being very easy to setup for the novice user. I use Scientific Linux 5.x and because they are RHEL derivatives I see the same thing in the SL repo's. I have used the rsyslog from the repo's yet, all my rsyslog servers are based on EL4, but I'll try to help below. > My questions are a little bit newbie... before I try installing > this, I want to know what it's going to do to my system: > > 1) Will it disable syslogd and/or klogd? Or will it add itself using > the "alternatives" paradigm so I can switch between them that way? > If neither, does it include startup scripts at all? If they are there > but not used by default, is there a recommended way to make the > switch and not really screw things up? You should try this on a test box. I haven't tried it but I think it should remove syslog RPM's from your installation and then install rsyslog. It should also make a /etc/syslog.conf.rpmsave file which you can reference for use in /etc/rsyslog.conf > 2) Will it add itself to my cron jobs? Specifically, I don't mind > (for now) leaving the log rotation alone (don't let rsyslog manage my > rotations). If it adds itself to my cron jobs, does that mean it > will remove the logrotate cron job? Not sure sorry. You should grab the src.rpm file from CentOS, install it and take a look at the rsyslog.spec and it'll show you what it does on the post install section. > 2.5) If I keep using the old logrotate with rsyslog, will that create > any conflicts? I don't see how any conflicts will occur with logroate, since rsyslog basically logs to the same files that syslog logs to. It's meant to be a drop in replacement. Maybe specific questions about rsyslog with CentOS (or other derivatives) would actually be better in the CentOS or Scientific Linux mailing lists? Michael. > Generally my aim is not to commit 100% to rsyslog yet, so I don't > want to get to a situation where it's a lot of work to get back to > the default syslog setup. _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com ------- End of Original Message ------- From tbergfeld at hq.adiscon.com Fri Aug 21 11:09:26 2009 From: tbergfeld at hq.adiscon.com (Tom Bergfeld) Date: Fri, 21 Aug 2009 11:09:26 +0200 Subject: [rsyslog] rsyslog 4.4.0 (v4-stable) released Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD53@GRFEXC.intern.adiscon.com> Hi all, We have just released rsyslog 4.4.0.This is the next incarnation of the v4-stable branch, bringing all features available in the current v4-beta (4.3.2) as well as some additional fixes. Be sure to review the 4.3.x change logs to see all new features included in this release. This is a recommended update for all users of the v4-stable branch. Download: http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-171.phtml Changelog: http://www.rsyslog.com/Article394.phtml As always, feedback is appreciated. Best regards, Tom Bergfeld -- Support ======= Improving rsyslog is costly, but you can help! We are looking for organizations that find rsyslog useful and wish to contribute back. You can contribute by reporting bugs, improve the software, or donate money or equipment. Commercial support contracts for rsyslog are available, and they help finance continued maintenance. Adiscon GmbH, a privately held German company, is currently funding rsyslog development. We are always looking for interesting development projects. For details on how to help, please see http://www.rsyslog.com/doc-how2help.html . From tbergfeld at hq.adiscon.com Fri Aug 21 11:13:20 2009 From: tbergfeld at hq.adiscon.com (Tom Bergfeld) Date: Fri, 21 Aug 2009 11:13:20 +0200 Subject: [rsyslog] rsyslog 4.5.2 (beta) released Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD54@GRFEXC.intern.adiscon.com> Hi all, We have just released rsyslog 4.5.2. This begins a new v4-beta version of rsyslog. It offers all the new features from the v4-development branch in a now-stabilizing branch. Most importantly, these are the ability to write log files in gzip format as well as performance enhancements. Please note that this version contains some fixes not yet found in v4-devel, so it superseeds v4-devel. Currently, there is no current v4-devel version. A new v4-devel version will be created as need arises. The majority of new features will go into v5-devel. Download: http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-172.phtml Changelog: http://www.rsyslog.com/Article395.phtml As always, feedback is appreciated. Best regards, Tom Bergfeld -- Support ======= Improving rsyslog is costly, but you can help! We are looking for organizations that find rsyslog useful and wish to contribute back. You can contribute by reporting bugs, improve the software, or donate money or equipment. Commercial support contracts for rsyslog are available, and they help finance continued maintenance. Adiscon GmbH, a privately held German company, is currently funding rsyslog development. We are always looking for interesting development projects. For details on how to help, please see http://www.rsyslog.com/doc-how2help.html . From david at lang.hm Fri Aug 21 12:33:50 2009 From: david at lang.hm (david at lang.hm) Date: Fri, 21 Aug 2009 03:33:50 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> Message-ID: On Thu, 20 Aug 2009, Rainer Gerhards wrote: >> the second problem from my initial e-mail (and the one I mentioned in >> response to the 5.1.4 release) is pointed out by this portion of my >> initial e-mail >> >> <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= >> /192.168.243.38 destination=179.50.100.130/11074 >> >> 192.168.210.245 192.168.210.245 methane1d-b >> >> >> in addition to not parsing the message correctly and putting the >> hostnmae >> in the syslogtag field, the fromhost is incorrect. this message could >> only >> have gotten here by being relayed from the .219 box. the log file on >> the >> .245 box (which logs *.* to messages) don't show anything like this, >> and >> the methane1d-b box doesn't have any networks in common with the .245 >> box > > Is there any possibility that *any* message was previously received from > .245? From what you said, I assume not. I am asking because the problem may > be related to the name lookup reuse technique (which re-uses the sender IP if > it is the same as with the last UDP packet). However, the problem can only be > rooted in that area if .245 is a valid sender at least at times. yes, .245 is a valid sender (in fact, in the main set of mssages I sent initilly all the messages from .245 are legit) all of the sources listed in the config do send messages (for each pair, one is the active relay sending hundreds to thousands of messages/sec while the other is the backup, sending a handful per min) > I have also done a source code review right now, but I do not see any > suspicious. Will continue to try a little bit, but that one does probably > need to wait until we can do some debugging in your environment. understood. I may try to go in over the weekend and setup this sort of test. David Lang > Rainer > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Fri Aug 21 12:54:03 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 21 Aug 2009 12:54:03 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> > >> the second problem from my initial e-mail (and the one I mentioned > in > >> response to the 5.1.4 release) is pointed out by this portion of my > >> initial e-mail > >> > >> <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= > >> /192.168.243.38 destination=179.50.100.130/11074 > >> > >> 192.168.210.245 192.168.210.245 methane1d-b > >> > >> > >> in addition to not parsing the message correctly and putting the > >> hostnmae > >> in the syslogtag field, the fromhost is incorrect. this message > could > >> only > >> have gotten here by being relayed from the .219 box. the log file on > >> the > >> .245 box (which logs *.* to messages) don't show anything like this, > >> and > >> the methane1d-b box doesn't have any networks in common with the > .245 > >> box > > > > Is there any possibility that *any* message was previously received > from > > .245? From what you said, I assume not. I am asking because the > problem may > > be related to the name lookup reuse technique (which re-uses the > sender IP if > > it is the same as with the last UDP packet). However, the problem can > only be > > rooted in that area if .245 is a valid sender at least at times. > > yes, .245 is a valid sender (in fact, in the main set of mssages I sent > initilly all the messages from .245 are legit) > > all of the sources listed in the config do send messages (for each > pair, > one is the active relay sending hundreds to thousands of messages/sec > while the other is the backup, sending a handful per min) OK, this looks like it narrows down the code to look at to a relatively small portion. Will give that another review. > > I have also done a source code review right now, but I do not see any > > suspicious. Will continue to try a little bit, but that one does > probably > > need to wait until we can do some debugging in your environment. > > understood. I may try to go in over the weekend and setup this sort of > test. Do you think this can be easily reproduced? If so, that would be great. I could simply comment out some of the code I suspect to cause the bug and so we could check both versions and see if it makes a difference... Rainer From david at lang.hm Fri Aug 21 12:56:36 2009 From: david at lang.hm (david at lang.hm) Date: Fri, 21 Aug 2009 03:56:36 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> Message-ID: On Fri, 21 Aug 2009, Rainer Gerhards wrote: >>>> the second problem from my initial e-mail (and the one I mentioned >> in >>>> response to the 5.1.4 release) is pointed out by this portion of my >>>> initial e-mail >>>> >>>> <29>Jul 31 21:33:39 methane1d-b plug-gw[13212]: connect host= >>>> /192.168.243.38 destination=179.50.100.130/11074 >>>> >>>> 192.168.210.245 192.168.210.245 methane1d-b >>>> >>>> >>>> in addition to not parsing the message correctly and putting the >>>> hostnmae >>>> in the syslogtag field, the fromhost is incorrect. this message >> could >>>> only >>>> have gotten here by being relayed from the .219 box. the log file on >>>> the >>>> .245 box (which logs *.* to messages) don't show anything like this, >>>> and >>>> the methane1d-b box doesn't have any networks in common with the >> .245 >>>> box >>> >>> Is there any possibility that *any* message was previously received >> from >>> .245? From what you said, I assume not. I am asking because the >> problem may >>> be related to the name lookup reuse technique (which re-uses the >> sender IP if >>> it is the same as with the last UDP packet). However, the problem can >> only be >>> rooted in that area if .245 is a valid sender at least at times. >> >> yes, .245 is a valid sender (in fact, in the main set of mssages I sent >> initilly all the messages from .245 are legit) >> >> all of the sources listed in the config do send messages (for each >> pair, >> one is the active relay sending hundreds to thousands of messages/sec >> while the other is the backup, sending a handful per min) > > OK, this looks like it narrows down the code to look at to a relatively small > portion. Will give that another review. > >>> I have also done a source code review right now, but I do not see any >>> suspicious. Will continue to try a little bit, but that one does >> probably >>> need to wait until we can do some debugging in your environment. >> >> understood. I may try to go in over the weekend and setup this sort of >> test. > > Do you think this can be easily reproduced? If so, that would be great. I > could simply comment out some of the code I suspect to cause the bug and so > we could check both versions and see if it makes a difference... yes, I can compile and install a new version and run it for a few min. this problem seems to show up fairly rapidly. David Lang From rgerhards at hq.adiscon.com Fri Aug 21 12:59:07 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 21 Aug 2009 12:59:07 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> > > Do you think this can be easily reproduced? If so, that would be > great. I > > could simply comment out some of the code I suspect to cause the bug > and so > > we could check both versions and see if it makes a difference... > > yes, I can compile and install a new version and run it for a few min. > this problem seems to show up fairly rapidly. OK, that's great :) So I would suggest this course of actions: 1. get a debug log exposing the problem (chances are good, 50:50, it will pinpoint the issue) if we can not see what causes it: 2. create special version without the code I suspect Then look at the results and see what to do... Is that OK with you? Rainer From david at lang.hm Fri Aug 21 13:28:45 2009 From: david at lang.hm (david at lang.hm) Date: Fri, 21 Aug 2009 04:28:45 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> Message-ID: On Fri, 21 Aug 2009, Rainer Gerhards wrote: >>> Do you think this can be easily reproduced? If so, that would be >> great. I >>> could simply comment out some of the code I suspect to cause the bug >> and so >>> we could check both versions and see if it makes a difference... >> >> yes, I can compile and install a new version and run it for a few min. >> this problem seems to show up fairly rapidly. > > OK, that's great :) > > So I would suggest this course of actions: > > 1. get a debug log exposing the problem > (chances are good, 50:50, it will pinpoint the issue) > > if we can not see what causes it: > > 2. create special version without the code I suspect > > Then look at the results and see what to do... > > Is that OK with you? sounds good. do you have a patch (or a commit to revert) for step 2? David Lang From rgerhards at hq.adiscon.com Fri Aug 21 14:00:55 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 21 Aug 2009 14:00:55 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Friday, August 21, 2009 1:29 PM > To: rsyslog-users > Subject: Re: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost > > On Fri, 21 Aug 2009, Rainer Gerhards wrote: > > >>> Do you think this can be easily reproduced? If so, that would be > >> great. I > >>> could simply comment out some of the code I suspect to cause the > bug > >> and so > >>> we could check both versions and see if it makes a difference... > >> > >> yes, I can compile and install a new version and run it for a few > min. > >> this problem seems to show up fairly rapidly. > > > > OK, that's great :) > > > > So I would suggest this course of actions: > > > > 1. get a debug log exposing the problem > > (chances are good, 50:50, it will pinpoint the issue) > > > > if we can not see what causes it: > > > > 2. create special version without the code I suspect > > > > Then look at the results and see what to do... > > > > Is that OK with you? > > sounds good. do you have a patch (or a commit to revert) for step 2? not yet, will see if I can create one today, but it may be useful to have the debug log (step 1) first. I understand that you would like to do both tests at once, but it depends a bit on how easy it is to "comment out" the code (I think it is easy). Will post an update later today. Rainer From rgerhards at hq.adiscon.com Fri Aug 21 15:20:13 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 21 Aug 2009 15:20:13 +0200 Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> > > sounds good. do you have a patch (or a commit to revert) for step 2? > > not yet, will see if I can create one today, but it may be useful to > have the > debug log (step 1) first. I understand that you would like to do both > tests > at once, but it depends a bit on how easy it is to "comment out" the > code (I > think it is easy). Will post an update later today. David, I think I have found the bug :) It was one of those that you actually overlook while reviewing code, creating the test branches helped. I used an "and" where an "or" war required in a predicate check, thus strings were always re-used if the size of the former and the current string matched. That would very well explain what you saw (the host IPs were of equal length). In any case, it is a bug, and it is fixed in the master branch: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=cdb58f8d913dc47b01f61f5a 72a83ce6aea26623 Just in case it should not address what you see, I have created two testing branches for you: these are "david-test2a" and "david-test2b". They disable different parts of the reuse logic (while crafting 2b I finally saw the issue...). Feedback appreciated. Rainer From rgerhards at hq.adiscon.com Fri Aug 21 18:12:54 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 21 Aug 2009 18:12:54 +0200 Subject: [rsyslog] Arbitrary string replacements References: <378058110908201738h689bb025j2e95850e5cdff822@mail.gmail.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD6E@GRFEXC.intern.adiscon.com> can you elaborate a little of how you would like to use it? It still would be a good idea to create its own feature request inside the bug tracker - I look there if I have time to do new things, not so often in the mailing list archive ;) > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Ori Bani > Sent: Friday, August 21, 2009 2:38 AM > To: rsyslog at lists.adiscon.com > Subject: [rsyslog] Arbitrary string replacements > > Hi, > > I understand that arbitrary replacements on log messages is not > supported by rsyslog. I found a thread that explains it here: > > http://lists.adiscon.net/pipermail/rsyslog/2009-June/002317.html > > I'd like to give my vote for adding this feature. I have the same > requirement (or similar) to the OP of that thread. For now, I have to > use syslog-ng, which I understand has recently already implemented > this feature, or if I want to use rsyslog, I have to drop (discard) > the messages that have information that I am not allowed to keep in my > logs (that with IP addreses): > > # This discards any message with an IP (ver. 4) address in it > :msg, regex, "[0-9]\.[0-9]\.[0-9]\.[0-9]" ~ > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From oribani at gmail.com Sat Aug 22 03:45:45 2009 From: oribani at gmail.com (Ori Bani) Date: Fri, 21 Aug 2009 18:45:45 -0700 Subject: [rsyslog] Arbitrary string replacements In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD6E@GRFEXC.intern.adiscon.com> References: <378058110908201738h689bb025j2e95850e5cdff822@mail.gmail.com> <9B6E2A8877C38245BFB15CC491A11DA706FD6E@GRFEXC.intern.adiscon.com> Message-ID: <378058110908211845l1cd1d2bbr3e5a4f3bb866267@mail.gmail.com> On 8/21/09, Rainer Gerhards wrote: > can you elaborate a little of how you would like to use it? It still would > be > a good idea to create its own feature request inside the bug tracker - I > look > there if I have time to do new things, not so often in the mailing list > archive ;) >From my understanding it would basically be an extension to the regex functionality in the property replacer. You already have submatch numbers and all that, but you only allow the string that's used to be a sub-string (submatch), and what I need is the ability to provide a custom pattern replacement. If you did that, you could eliminate the really confusing list of submatch number and match number (and "nomatch"?), because those would be specified in the replacement pattern (hopefully as $1, $2, etc.) -- just like any other regular expression pattern replacement operation in many tools and languages. Thanks! >> I understand that arbitrary replacements on log messages is not >> supported by rsyslog. I found a thread that explains it here: >> >> http://lists.adiscon.net/pipermail/rsyslog/2009-June/002317.html >> >> I'd like to give my vote for adding this feature. I have the same >> requirement (or similar) to the OP of that thread. For now, I have to >> use syslog-ng, which I understand has recently already implemented >> this feature, or if I want to use rsyslog, I have to drop (discard) >> the messages that have information that I am not allowed to keep in my >> logs (that with IP addreses): >> >> # This discards any message with an IP (ver. 4) address in it >> :msg, regex, "[0-9]\.[0-9]\.[0-9]\.[0-9]" ~ From rsyslog-users at iotk.net Mon Aug 24 09:26:29 2009 From: rsyslog-users at iotk.net (VR) Date: Mon, 24 Aug 2009 03:26:29 -0400 Subject: [rsyslog] Time of rotation? Message-ID: <4A9240A5.4010909@iotk.net> Hello, I'm working with a debian (stable release 2.6.26-1-686) system and not finding where rsyslog is rotating its /var/log/mail.* files from. I thought it was via logrotate using cron but I'm coming up empty. Can anyone provide some suggestions? From rgerhards at hq.adiscon.com Mon Aug 24 11:10:49 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 24 Aug 2009 11:10:49 +0200 Subject: [rsyslog] Time of rotation? References: <4A9240A5.4010909@iotk.net> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD70@GRFEXC.intern.adiscon.com> rsyslog does not do any rotation on its own (except if you use output channels and explicitely define this. Usually, logrotate is used to rotate files. So this looks like a problem with your config. Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of VR > Sent: Monday, August 24, 2009 9:26 AM > To: rsyslog at lists.adiscon.com > Subject: [rsyslog] Time of rotation? > > Hello, > > I'm working with a debian (stable release 2.6.26-1-686) system and not > finding where rsyslog is rotating its /var/log/mail.* files from. I > thought it was via logrotate using cron but I'm coming up empty. Can > anyone provide some suggestions? > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From dirk.schulz at kinzesberg.de Mon Aug 24 13:49:43 2009 From: dirk.schulz at kinzesberg.de (Dirk H. Schulz) Date: Mon, 24 Aug 2009 13:49:43 +0200 Subject: [rsyslog] Time of rotation? In-Reply-To: <4A9240A5.4010909@iotk.net> References: <4A9240A5.4010909@iotk.net> Message-ID: <4A927E57.1060003@kinzesberg.de> VR schrieb: > Hello, > > I'm working with a debian (stable release 2.6.26-1-686) system and not > finding where rsyslog is rotating its /var/log/mail.* files from. I > thought it was via logrotate using cron but I'm coming up empty. Did you check /etc/logrotate.conf AND /etc/logrotate.d/* ? Dirk From mbiebl at gmail.com Mon Aug 24 15:32:53 2009 From: mbiebl at gmail.com (Michael Biebl) Date: Mon, 24 Aug 2009 15:32:53 +0200 Subject: [rsyslog] Time of rotation? In-Reply-To: References: <4A9240A5.4010909@iotk.net> Message-ID: 2009/8/24 Michael Biebl : > 2009/8/24 VR : >> Hello, >> >> I'm working with a debian (stable release 2.6.26-1-686) system and not >> finding where rsyslog is rotating its /var/log/mail.* files from. I >> thought it was via logrotate using cron but I'm coming up empty. Can >> anyone provide some suggestions? > > /etc/logrotate.d/rsyslog > resp. > dpkg -L rsyslog | grep logrotate And about the exact time, when the rotation happens: See /etc/cron.daily/logrotate and grep daily /etc/crontab -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? From mbiebl at gmail.com Mon Aug 24 15:30:33 2009 From: mbiebl at gmail.com (Michael Biebl) Date: Mon, 24 Aug 2009 15:30:33 +0200 Subject: [rsyslog] Time of rotation? In-Reply-To: <4A9240A5.4010909@iotk.net> References: <4A9240A5.4010909@iotk.net> Message-ID: 2009/8/24 VR : > Hello, > > I'm working with a debian (stable release 2.6.26-1-686) system and not > finding where rsyslog is rotating its /var/log/mail.* files from. I > thought it was via logrotate using cron but I'm coming up empty. Can > anyone provide some suggestions? /etc/logrotate.d/rsyslog resp. dpkg -L rsyslog | grep logrotate Cheers, Michael -- Why is it that all of the instruments seeking intelligent life in the universe are pointed away from Earth? From david at lang.hm Mon Aug 24 22:52:04 2009 From: david at lang.hm (david at lang.hm) Date: Mon, 24 Aug 2009 13:52:04 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> Message-ID: On Fri, 21 Aug 2009, Rainer Gerhards wrote: > David, > > I think I have found the bug :) It was one of those that you actually > overlook while reviewing code, creating the test branches helped. I used an > "and" where an "or" war required in a predicate check, thus strings were > always re-used if the size of the former and the current string matched. That > would very well explain what you saw (the host IPs were of equal length). In > any case, it is a bug, and it is fixed in the master branch: > > http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=cdb58f8d913dc47b01f61f5a > 72a83ce6aea26623 > > Just in case it should not address what you see, I have created two testing > branches for you: these are "david-test2a" and "david-test2b". They disable > different parts of the reuse logic (while crafting 2b I finally saw the > issue...). I compiled and installed master (commit b0d76b2c) and it looks like it solved this problem I'm testing to see if it has the problem I reported with 4.2.1 where it dies under load from malformed messages. David Lang From david at lang.hm Mon Aug 24 23:06:47 2009 From: david at lang.hm (david at lang.hm) Date: Mon, 24 Aug 2009 14:06:47 -0700 (PDT) Subject: [rsyslog] more 5.1.3 errors (fwd) / invalid fromhost In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> Message-ID: On Mon, 24 Aug 2009, david at lang.hm wrote: > On Fri, 21 Aug 2009, Rainer Gerhards wrote: > >> David, >> >> I think I have found the bug :) It was one of those that you actually >> overlook while reviewing code, creating the test branches helped. I used an >> "and" where an "or" war required in a predicate check, thus strings were >> always re-used if the size of the former and the current string matched. >> That >> would very well explain what you saw (the host IPs were of equal length). >> In >> any case, it is a bug, and it is fixed in the master branch: >> >> http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=cdb58f8d913dc47b01f61f5a >> 72a83ce6aea26623 >> >> Just in case it should not address what you see, I have created two testing >> branches for you: these are "david-test2a" and "david-test2b". They disable >> different parts of the reuse logic (while crafting 2b I finally saw the >> issue...). > > I compiled and installed master (commit b0d76b2c) and it looks like it solved > this problem > > I'm testing to see if it has the problem I reported with 4.2.1 where it dies > under load from malformed messages. It finally died just like 4.2.1 did. It took a _lot_ longer (which may just be that the race condition to cause the crash is smaller, 5.x is _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, writing them locally and relaying them to another machine eats up <2% cpu according to top) I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) David Lang From rgerhards at hq.adiscon.com Tue Aug 25 12:31:21 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 12:31:21 +0200 Subject: [rsyslog] abort in 4.2.1 In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> Message-ID: <1251196281.3225.10.camel@rgf11> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: > > I'm testing to see if it has the problem I reported with 4.2.1 where it dies > > under load from malformed messages. > > It finally died just like 4.2.1 did. It took a _lot_ longer (which may > just be that the race condition to cause the crash is smaller, 5.x is > _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, > writing them locally and relaying them to another machine eats up <2% cpu > according to top) > > I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) The bad thing about debug mode is that not only it is slower, but it introduces some synchronization. So race bugs frequently disappear when debug mode is turned on. Anyhow, sometimes they persist and then the debug log often provides good information (aka "definitely worth a try" ;)). I did some basic testing with the malformed message you provided in an earlier message, but I unfortunately did not see anything that is not clean. I am still a bit of the assumption that the malformednes of the message is not a necessary condition for the segfault - but that needs to be seen. No abort happened (yet) in my lab. If the issue is easier to reproduce in v4, I suggest you go back to 4.4.0 (the current v4-stable) and we try to nail down it there. It would be good if we could find some predicate (like this and that traffic pattern) that would enable me to reproduce the problem in lab (if the debug log does not help). Please let me know your thoughts. Rainer From lists at luigirosa.com Tue Aug 25 15:27:26 2009 From: lists at luigirosa.com (Luigi Rosa) Date: Tue, 25 Aug 2009 15:27:26 +0200 Subject: [rsyslog] rsyslog 4.4.0 make error Message-ID: <4A93E6BE.8070704@luigirosa.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On a Linux CentOS without javac make procedure stops with this error make[2]: Entering directory `/usr/src/rsyslog-4.4.0/tests' CLASSPATH=..:./..:$CLASSPATH javac -d .. DiagTalker.java /bin/sh: line 6: javac: command not found make[2]: *** [classcheck.stamp] Error 127 make[2]: Leaving directory `/usr/src/rsyslog-4.4.0/tests' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/rsyslog-4.4.0' make: *** [all] Error 2 Commenting the lines referencing DiagTalker.java in /tests/Makefile solves the problem, but I think that somewehere there should be a test on the presence of java compiler Ciao, luigi - -- / +--[Luigi Rosa]-- \ When you have eliminated the impossible, whatever remains, however improbable, must be the truth. --Sherlock Holmes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkqT5rsACgkQ3kWu7Tfl6ZSFAQCfV/Nk45+Ge+qcrZPaMAD+u+Du WAUAmQFO8y/bm+Xy6r2t5ySofP8+SSpx =pDqR -----END PGP SIGNATURE----- From rgerhards at hq.adiscon.com Tue Aug 25 15:37:47 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 15:37:47 +0200 Subject: [rsyslog] rsyslog 4.4.0 make error References: <4A93E6BE.8070704@luigirosa.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD81@GRFEXC.intern.adiscon.com> pls see: http://bugzilla.adiscon.com/show_bug.cgi?id=146 Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Luigi Rosa > Sent: Tuesday, August 25, 2009 3:27 PM > To: rsyslog-users > Subject: [rsyslog] rsyslog 4.4.0 make error > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On a Linux CentOS without javac make procedure stops with this error > > > > make[2]: Entering directory `/usr/src/rsyslog-4.4.0/tests' > CLASSPATH=..:./..:$CLASSPATH javac -d .. DiagTalker.java > /bin/sh: line 6: javac: command not found > make[2]: *** [classcheck.stamp] Error 127 > make[2]: Leaving directory `/usr/src/rsyslog-4.4.0/tests' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/usr/src/rsyslog-4.4.0' > make: *** [all] Error 2 > > > > Commenting the lines referencing DiagTalker.java in /tests/Makefile > solves the > problem, but I think that somewehere there should be a test on the > presence of > java compiler > > > > > Ciao, > luigi > > - -- > / > +--[Luigi Rosa]-- > \ > > When you have eliminated the impossible, whatever remains, however > improbable, must be the truth. > --Sherlock Holmes > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkqT5rsACgkQ3kWu7Tfl6ZSFAQCfV/Nk45+Ge+qcrZPaMAD+u+Du > WAUAmQFO8y/bm+Xy6r2t5ySofP8+SSpx > =pDqR > -----END PGP SIGNATURE----- > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From lists at luigirosa.com Tue Aug 25 15:38:48 2009 From: lists at luigirosa.com (Luigi Rosa) Date: Tue, 25 Aug 2009 15:38:48 +0200 Subject: [rsyslog] rsyslog 4.4.0 make error In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD81@GRFEXC.intern.adiscon.com> References: <4A93E6BE.8070704@luigirosa.com> <9B6E2A8877C38245BFB15CC491A11DA706FD81@GRFEXC.intern.adiscon.com> Message-ID: <4A93E968.20102@luigirosa.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Rainer Gerhards said the following on 25/08/09 15:37: > pls see: http://bugzilla.adiscon.com/show_bug.cgi?id=146 Ok, thanks a lot! Ciao, luigi - -- / +--[Luigi Rosa]-- \ I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhauser gate. All those moments will be lost in time, like tears in rain. Time to die. --Roy Batty, "Blade Runner" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkqT6WgACgkQ3kWu7Tfl6ZR4wQCgwy5GWbUPl/zxncftfhlx899Z /MsAn3JJyJB/6VSdLQuF2rKEJSq9T7at =jaZ1 -----END PGP SIGNATURE----- From david at lang.hm Tue Aug 25 16:19:48 2009 From: david at lang.hm (david at lang.hm) Date: Tue, 25 Aug 2009 07:19:48 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <1251196281.3225.10.camel@rgf11> References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> <1251196281.3225.10.camel@rgf11> Message-ID: On Tue, 25 Aug 2009, Rainer Gerhards wrote: > On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>> under load from malformed messages. >> >> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >> just be that the race condition to cause the crash is smaller, 5.x is >> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >> writing them locally and relaying them to another machine eats up <2% cpu >> according to top) >> >> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) > > The bad thing about debug mode is that not only it is slower, but it > introduces some synchronization. So race bugs frequently disappear when > debug mode is turned on. Anyhow, sometimes they persist and then the > debug log often provides good information (aka "definitely worth a > try" ;)). > > I did some basic testing with the malformed message you provided in an > earlier message, but I unfortunately did not see anything that is not > clean. I am still a bit of the assumption that the malformednes of the > message is not a necessary condition for the segfault - but that needs > to be seen. No abort happened (yet) in my lab. I did finally get it to die, as soon as I get into the office I'll look at the end of the debug log the box I am duplicating this problem on relays all the logs it recieves up to another central box. the logs that come through this box are about a tenth of the total logs that the central box gets, and that central box has had no problems. the things that I see as being different are 1. the central box doesn't see the malformed messages (one of the relay boxes would fix that before forwarding it) 2. there are fewer systems sending simultaniously to the central box (there are ~100 boxes sending to the relay that dies, but only a half dozen relay boxes sending to the central box) two of the other relays handle a _far_ higher rate of logs, but from fewer sources (one has one source that spews ~15G of logs/day, the other recieves ~100m logs/day from 6 machines). a third relay has more machines sending it logs, but at a lower rate than those two (but still significantly higher than the one that fails). if there was a problem with load or the number of messages being recieved simultaniously I would expect one of these other three to have more problems than the one that fails on me. 3. a noticable fraction of the logs sent through this relay box are sent by a cron job running on each of ~60 machines that wakes up every min and scrapes a local file, sending all the pending messages, so the incoming messages are a bit burstier than normal, the relaying is still bursty, but it is only one bursty box, not many note that even if this cron job is stopped I still had 4.2.1 die on this relay box, so I don't think that it's the bursty nature of the traffic this is why I'm suspicious of the malformed message handling David Lang From rgerhards at hq.adiscon.com Tue Aug 25 16:44:26 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 16:44:26 +0200 Subject: [rsyslog] abort in 4.2.1 Message-ID: <000101ca2592$882c9d41$100013ac@intern.adiscon.com> Ok that is good info. I'll still standby for the debug log, but if that doesn't show anything I'll probably look into crafting some small tools to create a similiar environment. Do the malformed messages theselv come in in burts (potentially without wellformed in between)? rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 25.08.09 16:20 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>> under load from malformed messages. >> >> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >> just be that the race condition to cause the crash is smaller, 5.x is >> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >> writing them locally and relaying them to another machine eats up <2% cpu >> according to top) >> >> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) > > The bad thing about debug mode is that not only it is slower, but it > introduces some synchronization. So race bugs frequently disappear when > debug mode is turned on. Anyhow, sometimes they persist and then the > debug log often provides good information (aka "definitely worth a > try" ;)). > > I did some basic testing with the malformed message you provided in an > earlier message, but I unfortunately did not see anything that is not > clean. I am still a bit of the assumption that the malformednes of the > message is not a necessary condition for the segfault - but that needs > to be seen. No abort happened (yet) in my lab. I did finally get it to die, as soon as I get into the office I'll look at the end of the debug log the box I am duplicating this problem on relays all the logs it recieves up to another central box. the logs that come through this box are about a tenth of the total logs that the central box gets, and that central box has had no problems. the things that I see as being different are 1. the central box doesn't see the malformed messages (one of the relay boxes would fix that before forwarding it) 2. there are fewer systems sending simultaniously to the central box (there are ~100 boxes sending to the relay that dies, but only a half dozen relay boxes sending to the central box) two of the other relays handle a _far_ higher rate of logs, but from fewer sources (one has one source that spews ~15G of logs/day, the other recieves ~100m logs/day from 6 machines). a third relay has more machines sending it logs, but at a lower rate than those two (but still significantly higher than the one that fails). if there was a problem with load or the number of messages being recieved simultaniously I would expect one of these other three to have more problems than the one that fails on me. 3. a noticable fraction of the logs sent through this relay box are sent by a cron job running on each of ~60 machines that wakes up every min and scrapes a local file, sending all the pending messages, so the incoming messages are a bit burstier than normal, the relaying is still bursty, but it is only one bursty box, not many note that even if this cron job is stopped I still had 4.2.1 die on this relay box, so I don't think that it's the bursty nature of the traffic this is why I'm suspicious of the malformed message handling David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From david at lang.hm Tue Aug 25 17:16:11 2009 From: david at lang.hm (david at lang.hm) Date: Tue, 25 Aug 2009 08:16:11 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <000101ca2592$882c9d41$100013ac@intern.adiscon.com> References: <000101ca2592$882c9d41$100013ac@intern.adiscon.com> Message-ID: On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Date: Tue, 25 Aug 2009 16:44:26 +0200 > From: Rainer Gerhards > Reply-To: rsyslog-users > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > Ok that is good info. I'll still standby for the debug log, but if that > doesn't show anything I'll probably look into crafting some small tools > to create a similiar environment. Do the malformed messages theselv come > in in burts (potentially without wellformed in between)? the ones from the cron job definantly come in bursts, but even after I had them modify that script to make those messages well-formed I still had it die (at the moment I had them revert that script to assist in this debugging here is the tail of the debug log (with the messages themselves lightly sanitized) note that the debug log was _very_ large -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug like the prior debugs, this dies on one of the malformed messages 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' 9570.652794351:418d6950: Message has legacy syslog format. 9570.652803191:418d6950: Called action, logging to builtin-file 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx 9570.652836228:418d6950: entering actionCalldoAction(), state: itx 9570.652845667:418d6950: file to log to: /var/log/messages 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy 9570.652893624:418d6950: action call returned 0 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.652909382:418d6950: XXXX: submitBatch got state 0 9570.652917182:418d6950: XXXX: submitBatch got state 0 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.652941060:418d6950: XXXX: qAddDirect returns 0 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.652988096:418d6950: Called action, logging to builtin-fwd 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx 9570.653021014:418d6950: entering actionCalldoAction(), state: itx 9570.653030533:418d6950: 192.168.210.8:514/udp 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy 9570.653054811:418d6950: action call returned 0 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.653071050:418d6950: XXXX: submitBatch got state 0 9570.653079010:418d6950: XXXX: submitBatch got state 0 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.653104368:418d6950: XXXX: qAddDirect returns 0 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.653152484:418d6950: XXXX: processMsg got return state 0 9570.653160723:418d6950: msgConsumer processes msg 28/32 9570.653168803:418d6950: dropped NUL at very end of message 9570.653352789:430d9950: recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 16:20 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>> under load from malformed messages. >>> >>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>> just be that the race condition to cause the crash is smaller, 5.x is >>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>> writing them locally and relaying them to another machine eats up <2% cpu >>> according to top) >>> >>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >> >> The bad thing about debug mode is that not only it is slower, but it >> introduces some synchronization. So race bugs frequently disappear when >> debug mode is turned on. Anyhow, sometimes they persist and then the >> debug log often provides good information (aka "definitely worth a >> try" ;)). >> >> I did some basic testing with the malformed message you provided in an >> earlier message, but I unfortunately did not see anything that is not >> clean. I am still a bit of the assumption that the malformednes of the >> message is not a necessary condition for the segfault - but that needs >> to be seen. No abort happened (yet) in my lab. > > I did finally get it to die, as soon as I get into the office I'll look at > the end of the debug log > > the box I am duplicating this problem on relays all the logs it recieves > up to another central box. the logs that come through this box are about a > tenth of the total logs that the central box gets, and that central box > has had no problems. > > the things that I see as being different are > > 1. the central box doesn't see the malformed messages (one of the relay > boxes would fix that before forwarding it) > > 2. there are fewer systems sending simultaniously to the central box > (there are ~100 boxes sending to the relay that dies, but only a half > dozen relay boxes sending to the central box) > > two of the other relays handle a _far_ higher rate of logs, but from fewer > sources (one has one source that spews ~15G of logs/day, the other > recieves ~100m logs/day from 6 machines). a third relay has more machines > sending it logs, but at a lower rate than those two (but still > significantly higher than the one that fails). if there was a problem with > load or the number of messages being recieved simultaniously I would > expect one of these other three to have more problems than the one that > fails on me. > > 3. a noticable fraction of the logs sent through this relay box are sent > by a cron job running on each of ~60 machines that wakes up every min and > scrapes a local file, sending all the pending messages, so the incoming > messages are a bit burstier than normal, the relaying is still bursty, but > it is only one bursty box, not many > > note that even if this cron job is stopped I still had 4.2.1 die on this > relay box, so I don't think that it's the bursty nature of the traffic > > this is why I'm suspicious of the malformed message handling > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Tue Aug 25 17:55:10 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 17:55:10 +0200 Subject: [rsyslog] abort in 4.2.1 Message-ID: <000201ca259c$6def2c1e$100013ac@intern.adiscon.com> Mmhhh... Unfortunately, this does not show anything immediately obvious. Could you provide me with a gdb backtrace of the abort? Knowing where it aborted often helps... rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 25.08.09 17:16 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Date: Tue, 25 Aug 2009 16:44:26 +0200 > From: Rainer Gerhards > Reply-To: rsyslog-users > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > Ok that is good info. I'll still standby for the debug log, but if that > doesn't show anything I'll probably look into crafting some small tools > to create a similiar environment. Do the malformed messages theselv come > in in burts (potentially without wellformed in between)? the ones from the cron job definantly come in bursts, but even after I had them modify that script to make those messages well-formed I still had it die (at the moment I had them revert that script to assist in this debugging here is the tail of the debug log (with the messages themselves lightly sanitized) note that the debug log was _very_ large -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug like the prior debugs, this dies on one of the malformed messages 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' 9570.652794351:418d6950: Message has legacy syslog format. 9570.652803191:418d6950: Called action, logging to builtin-file 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx 9570.652836228:418d6950: entering actionCalldoAction(), state: itx 9570.652845667:418d6950: file to log to: /var/log/messages 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy 9570.652893624:418d6950: action call returned 0 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.652909382:418d6950: XXXX: submitBatch got state 0 9570.652917182:418d6950: XXXX: submitBatch got state 0 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.652941060:418d6950: XXXX: qAddDirect returns 0 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.652988096:418d6950: Called action, logging to builtin-fwd 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx 9570.653021014:418d6950: entering actionCalldoAction(), state: itx 9570.653030533:418d6950: 192.168.210.8:514/udp 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy 9570.653054811:418d6950: action call returned 0 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.653071050:418d6950: XXXX: submitBatch got state 0 9570.653079010:418d6950: XXXX: submitBatch got state 0 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.653104368:418d6950: XXXX: qAddDirect returns 0 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.653152484:418d6950: XXXX: processMsg got return state 0 9570.653160723:418d6950: msgConsumer processes msg 28/32 9570.653168803:418d6950: dropped NUL at very end of message 9570.653352789:430d9950: recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 16:20 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>> under load from malformed messages. >>> >>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>> just be that the race condition to cause the crash is smaller, 5.x is >>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>> writing them locally and relaying them to another machine eats up <2% cpu >>> according to top) >>> >>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >> >> The bad thing about debug mode is that not only it is slower, but it >> introduces some synchronization. So race bugs frequently disappear when >> debug mode is turned on. Anyhow, sometimes they persist and then the >> debug log often provides good information (aka "definitely worth a >> try" ;)). >> >> I did some basic testing with the malformed message you provided in an >> earlier message, but I unfortunately did not see anything that is not >> clean. I am still a bit of the assumption that the malformednes of the >> message is not a necessary condition for the segfault - but that needs >> to be seen. No abort happened (yet) in my lab. > > I did finally get it to die, as soon as I get into the office I'll look at > the end of the debug log > > the box I am duplicating this problem on relays all the logs it recieves > up to another central box. the logs that come through this box are about a > tenth of the total logs that the central box gets, and that central box > has had no problems. > > the things that I see as being different are > > 1. the central box doesn't see the malformed messages (one of the relay > boxes would fix that before forwarding it) > > 2. there are fewer systems sending simultaniously to the central box > (there are ~100 boxes sending to the relay that dies, but only a half > dozen relay boxes sending to the central box) > > two of the other relays handle a _far_ higher rate of logs, but from fewer > sources (one has one source that spews ~15G of logs/day, the other > recieves ~100m logs/day from 6 machines). a third relay has more machines > sending it logs, but at a lower rate than those two (but still > significantly higher than the one that fails). if there was a problem with > load or the number of messages being recieved simultaniously I would > expect one of these other three to have more problems than the one that > fails on me. > > 3. a noticable fraction of the logs sent through this relay box are sent > by a cron job running on each of ~60 machines that wakes up every min and > scrapes a local file, sending all the pending messages, so the incoming > messages are a bit burstier than normal, the relaying is still bursty, but > it is only one bursty box, not many > > note that even if this cron job is stopped I still had 4.2.1 die on this > relay box, so I don't think that it's the bursty nature of the traffic > > this is why I'm suspicious of the malformed message handling > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From david at lang.hm Tue Aug 25 17:58:51 2009 From: david at lang.hm (david at lang.hm) Date: Tue, 25 Aug 2009 08:58:51 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <000201ca259c$6def2c1e$100013ac@intern.adiscon.com> References: <000201ca259c$6def2c1e$100013ac@intern.adiscon.com> Message-ID: On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Mmhhh... Unfortunately, this does not show anything immediately obvious. > Could you provide me with a gdb backtrace of the abort? Knowing where it > aborted often helps... I don't know how to do this. David Lang > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 17:16 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Date: Tue, 25 Aug 2009 16:44:26 +0200 >> From: Rainer Gerhards >> Reply-To: rsyslog-users >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> Ok that is good info. I'll still standby for the debug log, but if that >> doesn't show anything I'll probably look into crafting some small tools >> to create a similiar environment. Do the malformed messages theselv come >> in in burts (potentially without wellformed in between)? > > the ones from the cron job definantly come in bursts, but even after I had > them modify that script to make those messages well-formed I still had it > die (at the moment I had them revert that script to assist in this > debugging > > here is the tail of the debug log (with the messages themselves lightly > sanitized) > > note that the debug log was _very_ large > > -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug > > like the prior debugs, this dies on one of the malformed messages > > 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' > 9570.652794351:418d6950: Message has legacy syslog format. > 9570.652803191:418d6950: Called action, logging to builtin-file > 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx > 9570.652836228:418d6950: entering actionCalldoAction(), state: itx > 9570.652845667:418d6950: file to log to: /var/log/messages > 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 > 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 > 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes > 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy > 9570.652893624:418d6950: action call returned 0 > 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.652909382:418d6950: XXXX: submitBatch got state 0 > 9570.652917182:418d6950: XXXX: submitBatch got state 0 > 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 > 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.652941060:418d6950: XXXX: qAddDirect returns 0 > 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 > 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 > 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 > 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 > 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.652988096:418d6950: Called action, logging to builtin-fwd > 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx > 9570.653021014:418d6950: entering actionCalldoAction(), state: itx > 9570.653030533:418d6950: 192.168.210.8:514/udp > 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy > 9570.653054811:418d6950: action call returned 0 > 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.653071050:418d6950: XXXX: submitBatch got state 0 > 9570.653079010:418d6950: XXXX: submitBatch got state 0 > 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 > 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.653104368:418d6950: XXXX: qAddDirect returns 0 > 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 > 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 > 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 > 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 > 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.653152484:418d6950: XXXX: processMsg got return state 0 > 9570.653160723:418d6950: msgConsumer processes msg 28/32 > 9570.653168803:418d6950: dropped NUL at very end of message > 9570.653352789:430d9950: > recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX > > 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries > 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 > 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start > 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). > 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "david at lang.hm" >> An: "rsyslog-users" >> Gesendet: 25.08.09 16:20 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>>> under load from malformed messages. >>>> >>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>> just be that the race condition to cause the crash is smaller, 5.x is >>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>> writing them locally and relaying them to another machine eats up <2% cpu >>>> according to top) >>>> >>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>> >>> The bad thing about debug mode is that not only it is slower, but it >>> introduces some synchronization. So race bugs frequently disappear when >>> debug mode is turned on. Anyhow, sometimes they persist and then the >>> debug log often provides good information (aka "definitely worth a >>> try" ;)). >>> >>> I did some basic testing with the malformed message you provided in an >>> earlier message, but I unfortunately did not see anything that is not >>> clean. I am still a bit of the assumption that the malformednes of the >>> message is not a necessary condition for the segfault - but that needs >>> to be seen. No abort happened (yet) in my lab. >> >> I did finally get it to die, as soon as I get into the office I'll look at >> the end of the debug log >> >> the box I am duplicating this problem on relays all the logs it recieves >> up to another central box. the logs that come through this box are about a >> tenth of the total logs that the central box gets, and that central box >> has had no problems. >> >> the things that I see as being different are >> >> 1. the central box doesn't see the malformed messages (one of the relay >> boxes would fix that before forwarding it) >> >> 2. there are fewer systems sending simultaniously to the central box >> (there are ~100 boxes sending to the relay that dies, but only a half >> dozen relay boxes sending to the central box) >> >> two of the other relays handle a _far_ higher rate of logs, but from fewer >> sources (one has one source that spews ~15G of logs/day, the other >> recieves ~100m logs/day from 6 machines). a third relay has more machines >> sending it logs, but at a lower rate than those two (but still >> significantly higher than the one that fails). if there was a problem with >> load or the number of messages being recieved simultaniously I would >> expect one of these other three to have more problems than the one that >> fails on me. >> >> 3. a noticable fraction of the logs sent through this relay box are sent >> by a cron job running on each of ~60 machines that wakes up every min and >> scrapes a local file, sending all the pending messages, so the incoming >> messages are a bit burstier than normal, the relaying is still bursty, but >> it is only one bursty box, not many >> >> note that even if this cron job is stopped I still had 4.2.1 die on this >> relay box, so I don't think that it's the bursty nature of the traffic >> >> this is why I'm suspicious of the malformed message handling >> >> David Lang >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From david at lang.hm Tue Aug 25 18:19:10 2009 From: david at lang.hm (david at lang.hm) Date: Tue, 25 Aug 2009 09:19:10 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: References: <9B6E2A8877C38245BFB15CC491A11DA706FD40@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD4B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD59@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5B@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD5D@GRFEXC.intern.adiscon.com> <1251196281.3225.10.camel@rgf11> Message-ID: On Tue, 25 Aug 2009, david at lang.hm wrote: > 2. there are fewer systems sending simultaniously to the central box > (there are ~100 boxes sending to the relay that dies, but only a half > dozen relay boxes sending to the central box) > > two of the other relays handle a _far_ higher rate of logs, but from fewer > sources (one has one source that spews ~15G of logs/day, the other > recieves ~100m logs/day from 6 machines). a third relay has more machines > sending it logs, but at a lower rate than those two (but still > significantly higher than the one that fails). if there was a problem with > load or the number of messages being recieved simultaniously I would > expect one of these other three to have more problems than the one that > fails on me. as an idea of the message rates of the various boxes (they are closer than I thought the one that dies (~130 servers sending logs, including the 60 sending bursts from the cron job) 23962669 log messages busiest seconds 1798 Aug 24 13:11:08 1801 Aug 24 15:20:12 1831 Aug 24 21:15:01 1890 Aug 24 00:45:01 1921 Aug 24 17:26:22 1946 Aug 24 14:02:31 1983 Aug 24 21:29:59 2141 Aug 24 17:27:23 2142 Aug 24 16:15:01 2163 Aug 24 15:43:34 2433 Aug 24 13:00:01 busiest min total logs 32562 Aug 24 14:19 32579 Aug 24 17:07 32633 Aug 24 14:05 32637 Aug 24 13:09 32709 Aug 24 15:03 33317 Aug 24 14:09 33381 Aug 24 16:05 33466 Aug 24 14:07 33653 Aug 24 13:15 33883 Aug 24 14:08 34029 Aug 24 14:06 another one (61 servers) 31338876 log messages 1574 Aug 24 19:57:01 1578 Aug 24 22:07:47 1580 Aug 24 22:56:13 1583 Aug 24 19:56:44 1587 Aug 24 22:56:14 1587 Aug 24 22:07:22 1602 Aug 24 22:07:33 1613 Aug 24 19:56:50 1620 Aug 24 19:15:00 1732 Aug 24 22:55:07 1907 Aug 24 18:00:09 76928 Aug 24 06:57 77623 Aug 24 16:20 78251 Aug 24 19:15 78770 Aug 24 22:55 79803 Aug 24 22:07 81383 Aug 24 22:08 82841 Aug 24 07:42 85746 Aug 24 17:59 85870 Aug 24 16:22 86423 Aug 24 16:21 89161 Aug 24 01:09 another (13 servers) 24184377 log messages 1131 Aug 24 15:28:44 1150 Aug 23 22:00:02 1163 Aug 23 17:00:01 1165 Aug 24 15:28:45 1165 Aug 24 16:00:01 1231 Aug 24 12:00:02 1247 Aug 24 11:00:02 1298 Aug 24 08:00:02 1327 Aug 24 14:00:02 1330 Aug 24 05:00:02 1340 Aug 24 09:00:02 31931 Aug 24 13:48 31975 Aug 24 09:32 32090 Aug 24 10:34 32867 Aug 24 09:30 33246 Aug 24 10:52 33289 Aug 24 11:56 34161 Aug 24 13:47 34504 Aug 24 13:46 34850 Aug 24 13:31 35002 Aug 24 15:28 35287 Aug 24 13:45 another (7 servers) 29764771 log messages 1195 Aug 24 15:00:03 1214 Aug 24 15:28:44 1253 Aug 24 15:28:45 1261 Aug 24 16:00:01 1269 Aug 24 13:13:53 1329 Aug 24 12:00:02 1355 Aug 24 08:00:02 1358 Aug 24 05:00:02 1404 Aug 24 11:00:02 1410 Aug 24 14:00:02 1491 Aug 24 09:00:02 38272 Aug 24 13:47 38797 Aug 24 09:32 39211 Aug 24 11:56 39624 Aug 24 13:46 39666 Aug 24 10:34 39750 Aug 24 09:30 39802 Aug 24 13:13 39814 Aug 24 13:45 40817 Aug 24 10:52 42071 Aug 24 15:28 43968 Aug 24 13:31 another (115 servers) 124469193 log messages 3289 Aug 24 14:05:01 3312 Aug 24 15:44:44 3319 Aug 24 15:01:13 3319 Aug 24 15:05:16 3320 Aug 24 15:14:44 3331 Aug 24 14:15:34 3350 Aug 24 15:17:42 3422 Aug 24 15:54:44 3542 Aug 24 15:00:01 4075 Aug 24 16:38:16 4078 Aug 24 15:05:15 164209 Aug 24 15:13 164247 Aug 24 16:04 164274 Aug 24 14:19 164369 Aug 24 14:37 164581 Aug 24 15:24 164929 Aug 24 15:12 165015 Aug 24 15:35 165385 Aug 24 14:17 165446 Aug 24 15:34 165566 Aug 24 15:00 166864 Aug 24 15:04 central system (~10 servers sending direct) 251236252 log messages 7208 Aug 24 07:45:01 7318 Aug 24 08:00:02 7414 Aug 24 07:00:01 7427 Aug 24 09:00:01 7452 Aug 24 13:45:01 7723 Aug 24 10:00:01 7838 Aug 24 11:00:01 7858 Aug 24 13:00:01 7970 Aug 24 08:00:01 8155 Aug 24 16:00:01 289323 Aug 24 12:18 289376 Aug 24 12:56 290064 Aug 24 09:15 290077 Aug 24 09:27 291036 Aug 24 08:00 291212 Aug 24 09:38 299116 Aug 24 10:37 300814 Aug 24 09:07 301968 Aug 24 12:57 304175 Aug 24 09:14 306308 Aug 24 07:24 From rgerhards at hq.adiscon.com Tue Aug 25 18:29:57 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 18:29:57 +0200 Subject: [rsyslog] abort in 4.2.1 Message-ID: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> First shot at it: 1. Make sure core dump is written (ulimit -c 999999999) 2. Have it abort 3. bdb /path/to/binary/tsyslogd 4. Core name-of-corefile (usually /core.SOMENBR) 5. Enter: bt (for backtrace) 6. Enter: info thread (displays threads) 7. For each thread: 7a. Thread number 7b. Bt 8. You are done (ctl-d) Step 7 is necessary because the default bt does not necessarily point to the abort thread (some times it does, some times not...) rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 25.08.09 17:59 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Mmhhh... Unfortunately, this does not show anything immediately obvious. > Could you provide me with a gdb backtrace of the abort? Knowing where it > aborted often helps... I don't know how to do this. David Lang > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 17:16 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Date: Tue, 25 Aug 2009 16:44:26 +0200 >> From: Rainer Gerhards >> Reply-To: rsyslog-users >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> Ok that is good info. I'll still standby for the debug log, but if that >> doesn't show anything I'll probably look into crafting some small tools >> to create a similiar environment. Do the malformed messages theselv come >> in in burts (potentially without wellformed in between)? > > the ones from the cron job definantly come in bursts, but even after I had > them modify that script to make those messages well-formed I still had it > die (at the moment I had them revert that script to assist in this > debugging > > here is the tail of the debug log (with the messages themselves lightly > sanitized) > > note that the debug log was _very_ large > > -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug > > like the prior debugs, this dies on one of the malformed messages > > 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' > 9570.652794351:418d6950: Message has legacy syslog format. > 9570.652803191:418d6950: Called action, logging to builtin-file > 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx > 9570.652836228:418d6950: entering actionCalldoAction(), state: itx > 9570.652845667:418d6950: file to log to: /var/log/messages > 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 > 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 > 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes > 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy > 9570.652893624:418d6950: action call returned 0 > 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.652909382:418d6950: XXXX: submitBatch got state 0 > 9570.652917182:418d6950: XXXX: submitBatch got state 0 > 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 > 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.652941060:418d6950: XXXX: qAddDirect returns 0 > 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 > 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 > 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 > 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 > 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.652988096:418d6950: Called action, logging to builtin-fwd > 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx > 9570.653021014:418d6950: entering actionCalldoAction(), state: itx > 9570.653030533:418d6950: 192.168.210.8:514/udp > 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy > 9570.653054811:418d6950: action call returned 0 > 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.653071050:418d6950: XXXX: submitBatch got state 0 > 9570.653079010:418d6950: XXXX: submitBatch got state 0 > 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 > 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.653104368:418d6950: XXXX: qAddDirect returns 0 > 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 > 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 > 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 > 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 > 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.653152484:418d6950: XXXX: processMsg got return state 0 > 9570.653160723:418d6950: msgConsumer processes msg 28/32 > 9570.653168803:418d6950: dropped NUL at very end of message > 9570.653352789:430d9950: > recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX > > 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries > 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 > 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start > 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). > 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "david at lang.hm" >> An: "rsyslog-users" >> Gesendet: 25.08.09 16:20 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>>> under load from malformed messages. >>>> >>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>> just be that the race condition to cause the crash is smaller, 5.x is >>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>> writing them locally and relaying them to another machine eats up <2% cpu >>>> according to top) >>>> >>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>> >>> The bad thing about debug mode is that not only it is slower, but it >>> introduces some synchronization. So race bugs frequently disappear when >>> debug mode is turned on. Anyhow, sometimes they persist and then the >>> debug log often provides good information (aka "definitely worth a >>> try" ;)). >>> >>> I did some basic testing with the malformed message you provided in an >>> earlier message, but I unfortunately did not see anything that is not >>> clean. I am still a bit of the assumption that the malformednes of the >>> message is not a necessary condition for the segfault - but that needs >>> to be seen. No abort happened (yet) in my lab. >> >> I did finally get it to die, as soon as I get into the office I'll look at >> the end of the debug log >> >> the box I am duplicating this problem on relays all the logs it recieves >> up to another central box. the logs that come through this box are about a >> tenth of the total logs that the central box gets, and that central box >> has had no problems. >> >> the things that I see as being different are >> >> 1. the central box doesn't see the malformed messages (one of the relay >> boxes would fix that before forwarding it) >> >> 2. there are fewer systems sending simultaniously to the central box >> (there are ~100 boxes sending to the relay that dies, but only a half >> dozen relay boxes sending to the central box) >> >> two of the other relays handle a _far_ higher rate of logs, but from fewer >> sources (one has one source that spews ~15G of logs/day, the other >> recieves ~100m logs/day from 6 machines). a third relay has more machines >> sending it logs, but at a lower rate than those two (but still >> significantly higher than the one that fails). if there was a problem with >> load or the number of messages being recieved simultaniously I would >> expect one of these other three to have more problems than the one that >> fails on me. >> >> 3. a noticable fraction of the logs sent through this relay box are sent >> by a cron job running on each of ~60 machines that wakes up every min and >> scrapes a local file, sending all the pending messages, so the incoming >> messages are a bit burstier than normal, the relaying is still bursty, but >> it is only one bursty box, not many >> >> note that even if this cron job is stopped I still had 4.2.1 die on this >> relay box, so I don't think that it's the bursty nature of the traffic >> >> this is why I'm suspicious of the malformed message handling >> >> David Lang >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Tue Aug 25 18:49:58 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Tue, 25 Aug 2009 18:49:58 +0200 Subject: [rsyslog] abort in 4.2.1 Message-ID: <000501ca25a4$15498bbf$100013ac@intern.adiscon.com> Args.. Step 3 is "gdb" not "bdb"... ----- Urspr?ngliche Nachricht ----- Von: "Rainer Gerhards" An: "rsyslog-users" Gesendet: 25.08.09 18:30 Betreff: Re: [rsyslog] abort in 4.2.1 First shot at it: 1. Make sure core dump is written (ulimit -c 999999999) 2. Have it abort 3. bdb /path/to/binary/tsyslogd 4. Core name-of-corefile (usually /core.SOMENBR) 5. Enter: bt (for backtrace) 6. Enter: info thread (displays threads) 7. For each thread: 7a. Thread number 7b. Bt 8. You are done (ctl-d) Step 7 is necessary because the default bt does not necessarily point to the abort thread (some times it does, some times not...) rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 25.08.09 17:59 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Mmhhh... Unfortunately, this does not show anything immediately obvious. > Could you provide me with a gdb backtrace of the abort? Knowing where it > aborted often helps... I don't know how to do this. David Lang > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 17:16 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Date: Tue, 25 Aug 2009 16:44:26 +0200 >> From: Rainer Gerhards >> Reply-To: rsyslog-users >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> Ok that is good info. I'll still standby for the debug log, but if that >> doesn't show anything I'll probably look into crafting some small tools >> to create a similiar environment. Do the malformed messages theselv come >> in in burts (potentially without wellformed in between)? > > the ones from the cron job definantly come in bursts, but even after I had > them modify that script to make those messages well-formed I still had it > die (at the moment I had them revert that script to assist in this > debugging > > here is the tail of the debug log (with the messages themselves lightly > sanitized) > > note that the debug log was _very_ large > > -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug > > like the prior debugs, this dies on one of the malformed messages > > 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' > 9570.652794351:418d6950: Message has legacy syslog format. > 9570.652803191:418d6950: Called action, logging to builtin-file > 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx > 9570.652836228:418d6950: entering actionCalldoAction(), state: itx > 9570.652845667:418d6950: file to log to: /var/log/messages > 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 > 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 > 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes > 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy > 9570.652893624:418d6950: action call returned 0 > 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.652909382:418d6950: XXXX: submitBatch got state 0 > 9570.652917182:418d6950: XXXX: submitBatch got state 0 > 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 > 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.652941060:418d6950: XXXX: qAddDirect returns 0 > 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 > 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 > 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 > 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 > 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.652988096:418d6950: Called action, logging to builtin-fwd > 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 > 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx > 9570.653021014:418d6950: entering actionCalldoAction(), state: itx > 9570.653030533:418d6950: 192.168.210.8:514/udp > 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy > 9570.653054811:418d6950: action call returned 0 > 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.653071050:418d6950: XXXX: submitBatch got state 0 > 9570.653079010:418d6950: XXXX: submitBatch got state 0 > 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 > 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.653104368:418d6950: XXXX: qAddDirect returns 0 > 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 > 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 > 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 > 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 > 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.653152484:418d6950: XXXX: processMsg got return state 0 > 9570.653160723:418d6950: msgConsumer processes msg 28/32 > 9570.653168803:418d6950: dropped NUL at very end of message > 9570.653352789:430d9950: > recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX > > 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries > 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 > 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start > 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). > 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "david at lang.hm" >> An: "rsyslog-users" >> Gesendet: 25.08.09 16:20 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>>> under load from malformed messages. >>>> >>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>> just be that the race condition to cause the crash is smaller, 5.x is >>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>> writing them locally and relaying them to another machine eats up <2% cpu >>>> according to top) >>>> >>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>> >>> The bad thing about debug mode is that not only it is slower, but it >>> introduces some synchronization. So race bugs frequently disappear when >>> debug mode is turned on. Anyhow, sometimes they persist and then the >>> debug log often provides good information (aka "definitely worth a >>> try" ;)). >>> >>> I did some basic testing with the malformed message you provided in an >>> earlier message, but I unfortunately did not see anything that is not >>> clean. I am still a bit of the assumption that the malformednes of the >>> message is not a necessary condition for the segfault - but that needs >>> to be seen. No abort happened (yet) in my lab. >> >> I did finally get it to die, as soon as I get into the office I'll look at >> the end of the debug log >> >> the box I am duplicating this problem on relays all the logs it recieves >> up to another central box. the logs that come through this box are about a >> tenth of the total logs that the central box gets, and that central box >> has had no problems. >> >> the things that I see as being different are >> >> 1. the central box doesn't see the malformed messages (one of the relay >> boxes would fix that before forwarding it) >> >> 2. there are fewer systems sending simultaniously to the central box >> (there are ~100 boxes sending to the relay that dies, but only a half >> dozen relay boxes sending to the central box) >> >> two of the other relays handle a _far_ higher rate of logs, but from fewer >> sources (one has one source that spews ~15G of logs/day, the other >> recieves ~100m logs/day from 6 machines). a third relay has more machines >> sending it logs, but at a lower rate than those two (but still >> significantly higher than the one that fails). if there was a problem with >> load or the number of messages being recieved simultaniously I would >> expect one of these other three to have more problems than the one that >> fails on me. >> >> 3. a noticable fraction of the logs sent through this relay box are sent >> by a cron job running on each of ~60 machines that wakes up every min and >> scrapes a local file, sending all the pending messages, so the incoming >> messages are a bit burstier than normal, the relaying is still bursty, but >> it is only one bursty box, not many >> >> note that even if this cron job is stopped I still had 4.2.1 die on this >> relay box, so I don't think that it's the bursty nature of the traffic >> >> this is why I'm suspicious of the malformed message handling >> >> David Lang >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From david at lang.hm Tue Aug 25 23:56:31 2009 From: david at lang.hm (david at lang.hm) Date: Tue, 25 Aug 2009 14:56:31 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> Message-ID: On Tue, 25 Aug 2009, Rainer Gerhards wrote: > First shot at it: > > 1. Make sure core dump is written (ulimit -c 999999999) > 2. Have it abort > 3. bdb /path/to/binary/tsyslogd > 4. Core name-of-corefile (usually /core.SOMENBR) > 5. Enter: bt (for backtrace) > 6. Enter: info thread (displays threads) > 7. For each thread: > 7a. Thread number > 7b. Bt > 8. You are done (ctl-d) Core was generated by `rsyslogd -c5 -x'. Program terminated with signal 11, Segmentation fault. [New process 11534] [New process 11538] [New process 11535] [New process 11537] [New process 11533] [New process 11536] #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) info threads 6 process 11536 0x00007f313de4dce2 in select () from /lib/libc.so.6 5 process 11533 0x00007f313de4dce2 in select () from /lib/libc.so.6 4 process 11537 0x00007f313e4f27db in read () from /lib/libpthread.so.0 3 process 11535 0x00007f313de4dce2 in select () from /lib/libc.so.6 2 process 11538 0x00007f313de4dce2 in select () from /lib/libc.so.6 * 1 process 11534 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 (gdb) thread 1 [Switching to thread 1 (process 11534)]#0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) thread 2 [Switching to thread 2 (process 11538)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d1673b3 in ?? () from /usr/local/lib/rsyslog/imudp.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000dd0) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (process 11535)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x0000000000433f0a in thrdSleep (pThis=0x7f312c0008c0, iSeconds=, iuSeconds=) at ../threads.c:230 #2 0x00007f313d7739a3 in ?? () from /usr/local/lib/rsyslog/immark.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c0008c0) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (process 11537)]#0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 (gdb) bt #0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 #1 0x00007f313d36bdc7 in klogLogKMsg () from /usr/local/lib/rsyslog/imklog.so #2 0x00007f313d36b29c in ?? () from /usr/local/lib/rsyslog/imklog.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c000c20) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (process 11533)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x000000000040d55a in mainThread () at syslogd.c:2520 #2 0x000000000040ec1d in realMain (argc=, argv=0x0) at syslogd.c:3436 #3 0x00007f313dda31a6 in __libc_start_main () from /lib/libc.so.6 #4 0x000000000040ab49 in _start () (gdb) thread 6 [Switching to thread 6 (process 11536)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d5716f0 in ?? () from /usr/local/lib/rsyslog/imuxsock.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000a70) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () > Step 7 is necessary because the default bt does not necessarily point to the abort thread (some times it does, some times not...) > > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 17:59 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Mmhhh... Unfortunately, this does not show anything immediately obvious. >> Could you provide me with a gdb backtrace of the abort? Knowing where it >> aborted often helps... > > I don't know how to do this. > > David Lang > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "david at lang.hm" >> An: "rsyslog-users" >> Gesendet: 25.08.09 17:16 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> Date: Tue, 25 Aug 2009 16:44:26 +0200 >>> From: Rainer Gerhards >>> Reply-To: rsyslog-users >>> To: rsyslog-users >>> Subject: Re: [rsyslog] abort in 4.2.1 >>> >>> Ok that is good info. I'll still standby for the debug log, but if that >>> doesn't show anything I'll probably look into crafting some small tools >>> to create a similiar environment. Do the malformed messages theselv come >>> in in burts (potentially without wellformed in between)? >> >> the ones from the cron job definantly come in bursts, but even after I had >> them modify that script to make those messages well-formed I still had it >> die (at the moment I had them revert that script to assist in this >> debugging >> >> here is the tail of the debug log (with the messages themselves lightly >> sanitized) >> >> note that the debug log was _very_ large >> >> -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug >> >> like the prior debugs, this dies on one of the malformed messages >> >> 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' >> 9570.652794351:418d6950: Message has legacy syslog format. >> 9570.652803191:418d6950: Called action, logging to builtin-file >> 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 >> 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx >> 9570.652836228:418d6950: entering actionCalldoAction(), state: itx >> 9570.652845667:418d6950: file to log to: /var/log/messages >> 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 >> 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 >> 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes >> 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy >> 9570.652893624:418d6950: action call returned 0 >> 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.652909382:418d6950: XXXX: submitBatch got state 0 >> 9570.652917182:418d6950: XXXX: submitBatch got state 0 >> 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 >> 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.652941060:418d6950: XXXX: qAddDirect returns 0 >> 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 >> 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 >> 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 >> 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.652988096:418d6950: Called action, logging to builtin-fwd >> 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 >> 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx >> 9570.653021014:418d6950: entering actionCalldoAction(), state: itx >> 9570.653030533:418d6950: 192.168.210.8:514/udp >> 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy >> 9570.653054811:418d6950: action call returned 0 >> 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.653071050:418d6950: XXXX: submitBatch got state 0 >> 9570.653079010:418d6950: XXXX: submitBatch got state 0 >> 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 >> 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.653104368:418d6950: XXXX: qAddDirect returns 0 >> 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 >> 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 >> 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 >> 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.653152484:418d6950: XXXX: processMsg got return state 0 >> 9570.653160723:418d6950: msgConsumer processes msg 28/32 >> 9570.653168803:418d6950: dropped NUL at very end of message >> 9570.653352789:430d9950: >> recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX >> >> 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries >> 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 >> 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start >> 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). >> 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 >> >>> rainer >>> >>> ----- Urspr?ngliche Nachricht ----- >>> Von: "david at lang.hm" >>> An: "rsyslog-users" >>> Gesendet: 25.08.09 16:20 >>> Betreff: Re: [rsyslog] abort in 4.2.1 >>> >>> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >>> >>>> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>>>> under load from malformed messages. >>>>> >>>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>>> just be that the race condition to cause the crash is smaller, 5.x is >>>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>>> writing them locally and relaying them to another machine eats up <2% cpu >>>>> according to top) >>>>> >>>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>>> >>>> The bad thing about debug mode is that not only it is slower, but it >>>> introduces some synchronization. So race bugs frequently disappear when >>>> debug mode is turned on. Anyhow, sometimes they persist and then the >>>> debug log often provides good information (aka "definitely worth a >>>> try" ;)). >>>> >>>> I did some basic testing with the malformed message you provided in an >>>> earlier message, but I unfortunately did not see anything that is not >>>> clean. I am still a bit of the assumption that the malformednes of the >>>> message is not a necessary condition for the segfault - but that needs >>>> to be seen. No abort happened (yet) in my lab. >>> >>> I did finally get it to die, as soon as I get into the office I'll look at >>> the end of the debug log >>> >>> the box I am duplicating this problem on relays all the logs it recieves >>> up to another central box. the logs that come through this box are about a >>> tenth of the total logs that the central box gets, and that central box >>> has had no problems. >>> >>> the things that I see as being different are >>> >>> 1. the central box doesn't see the malformed messages (one of the relay >>> boxes would fix that before forwarding it) >>> >>> 2. there are fewer systems sending simultaniously to the central box >>> (there are ~100 boxes sending to the relay that dies, but only a half >>> dozen relay boxes sending to the central box) >>> >>> two of the other relays handle a _far_ higher rate of logs, but from fewer >>> sources (one has one source that spews ~15G of logs/day, the other >>> recieves ~100m logs/day from 6 machines). a third relay has more machines >>> sending it logs, but at a lower rate than those two (but still >>> significantly higher than the one that fails). if there was a problem with >>> load or the number of messages being recieved simultaniously I would >>> expect one of these other three to have more problems than the one that >>> fails on me. >>> >>> 3. a noticable fraction of the logs sent through this relay box are sent >>> by a cron job running on each of ~60 machines that wakes up every min and >>> scrapes a local file, sending all the pending messages, so the incoming >>> messages are a bit burstier than normal, the relaying is still bursty, but >>> it is only one bursty box, not many >>> >>> note that even if this cron job is stopped I still had 4.2.1 die on this >>> relay box, so I don't think that it's the bursty nature of the traffic >>> >>> this is why I'm suspicious of the malformed message handling >>> >>> David Lang >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 26 08:17:52 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 08:17:52 +0200 Subject: [rsyslog] abort in 4.2.1 Message-ID: <000601ca2614$f41ea262$100013ac@intern.adiscon.com> Excellent! This gives me sometging to work with. I could well envision that there is some quirck at that location - will do code review... rainer ----- Urspr?ngliche Nachricht ----- Von: "david at lang.hm" An: "rsyslog-users" Gesendet: 25.08.09 23:57 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > First shot at it: > > 1. Make sure core dump is written (ulimit -c 999999999) > 2. Have it abort > 3. bdb /path/to/binary/tsyslogd > 4. Core name-of-corefile (usually /core.SOMENBR) > 5. Enter: bt (for backtrace) > 6. Enter: info thread (displays threads) > 7. For each thread: > 7a. Thread number > 7b. Bt > 8. You are done (ctl-d) Core was generated by `rsyslogd -c5 -x'. Program terminated with signal 11, Segmentation fault. [New process 11534] [New process 11538] [New process 11535] [New process 11537] [New process 11533] [New process 11536] #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) info threads 6 process 11536 0x00007f313de4dce2 in select () from /lib/libc.so.6 5 process 11533 0x00007f313de4dce2 in select () from /lib/libc.so.6 4 process 11537 0x00007f313e4f27db in read () from /lib/libpthread.so.0 3 process 11535 0x00007f313de4dce2 in select () from /lib/libc.so.6 2 process 11538 0x00007f313de4dce2 in select () from /lib/libc.so.6 * 1 process 11534 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 (gdb) thread 1 [Switching to thread 1 (process 11534)]#0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) thread 2 [Switching to thread 2 (process 11538)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d1673b3 in ?? () from /usr/local/lib/rsyslog/imudp.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000dd0) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (process 11535)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x0000000000433f0a in thrdSleep (pThis=0x7f312c0008c0, iSeconds=, iuSeconds=) at ../threads.c:230 #2 0x00007f313d7739a3 in ?? () from /usr/local/lib/rsyslog/immark.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c0008c0) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (process 11537)]#0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 (gdb) bt #0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 #1 0x00007f313d36bdc7 in klogLogKMsg () from /usr/local/lib/rsyslog/imklog.so #2 0x00007f313d36b29c in ?? () from /usr/local/lib/rsyslog/imklog.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c000c20) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (process 11533)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x000000000040d55a in mainThread () at syslogd.c:2520 #2 0x000000000040ec1d in realMain (argc=, argv=0x0) at syslogd.c:3436 #3 0x00007f313dda31a6 in __libc_start_main () from /lib/libc.so.6 #4 0x000000000040ab49 in _start () (gdb) thread 6 [Switching to thread 6 (process 11536)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d5716f0 in ?? () from /usr/local/lib/rsyslog/imuxsock.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000a70) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () > Step 7 is necessary because the default bt does not necessarily point to the abort thread (some times it does, some times not...) > > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "david at lang.hm" > An: "rsyslog-users" > Gesendet: 25.08.09 17:59 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Mmhhh... Unfortunately, this does not show anything immediately obvious. >> Could you provide me with a gdb backtrace of the abort? Knowing where it >> aborted often helps... > > I don't know how to do this. > > David Lang > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "david at lang.hm" >> An: "rsyslog-users" >> Gesendet: 25.08.09 17:16 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> Date: Tue, 25 Aug 2009 16:44:26 +0200 >>> From: Rainer Gerhards >>> Reply-To: rsyslog-users >>> To: rsyslog-users >>> Subject: Re: [rsyslog] abort in 4.2.1 >>> >>> Ok that is good info. I'll still standby for the debug log, but if that >>> doesn't show anything I'll probably look into crafting some small tools >>> to create a similiar environment. Do the malformed messages theselv come >>> in in burts (potentially without wellformed in between)? >> >> the ones from the cron job definantly come in bursts, but even after I had >> them modify that script to make those messages well-formed I still had it >> die (at the moment I had them revert that script to assist in this >> debugging >> >> here is the tail of the debug log (with the messages themselves lightly >> sanitized) >> >> note that the debug log was _very_ large >> >> -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug >> >> like the prior debugs, this dies on one of the malformed messages >> >> 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' >> 9570.652794351:418d6950: Message has legacy syslog format. >> 9570.652803191:418d6950: Called action, logging to builtin-file >> 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 >> 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx >> 9570.652836228:418d6950: entering actionCalldoAction(), state: itx >> 9570.652845667:418d6950: file to log to: /var/log/messages >> 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 >> 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 >> 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes >> 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy >> 9570.652893624:418d6950: action call returned 0 >> 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.652909382:418d6950: XXXX: submitBatch got state 0 >> 9570.652917182:418d6950: XXXX: submitBatch got state 0 >> 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 >> 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.652941060:418d6950: XXXX: qAddDirect returns 0 >> 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 >> 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 >> 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 >> 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.652988096:418d6950: Called action, logging to builtin-fwd >> 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 >> 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx >> 9570.653021014:418d6950: entering actionCalldoAction(), state: itx >> 9570.653030533:418d6950: 192.168.210.8:514/udp >> 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy >> 9570.653054811:418d6950: action call returned 0 >> 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.653071050:418d6950: XXXX: submitBatch got state 0 >> 9570.653079010:418d6950: XXXX: submitBatch got state 0 >> 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 >> 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.653104368:418d6950: XXXX: qAddDirect returns 0 >> 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 >> 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 >> 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 >> 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.653152484:418d6950: XXXX: processMsg got return state 0 >> 9570.653160723:418d6950: msgConsumer processes msg 28/32 >> 9570.653168803:418d6950: dropped NUL at very end of message >> 9570.653352789:430d9950: >> recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX >> >> 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries >> 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 >> 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start >> 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). >> 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 >> >>> rainer >>> >>> ----- Urspr?ngliche Nachricht ----- >>> Von: "david at lang.hm" >>> An: "rsyslog-users" >>> Gesendet: 25.08.09 16:20 >>> Betreff: Re: [rsyslog] abort in 4.2.1 >>> >>> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >>> >>>> On Mon, 2009-08-24 at 14:06 -0700, david at lang.hm wrote: >>>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>>>>> under load from malformed messages. >>>>> >>>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>>> just be that the race condition to cause the crash is smaller, 5.x is >>>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>>> writing them locally and relaying them to another machine eats up <2% cpu >>>>> according to top) >>>>> >>>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>>> >>>> The bad thing about debug mode is that not only it is slower, but it >>>> introduces some synchronization. So race bugs frequently disappear when >>>> debug mode is turned on. Anyhow, sometimes they persist and then the >>>> debug log often provides good information (aka "definitely worth a >>>> try" ;)). >>>> >>>> I did some basic testing with the malformed message you provided in an >>>> earlier message, but I unfortunately did not see anything that is not >>>> clean. I am still a bit of the assumption that the malformednes of the >>>> message is not a necessary condition for the segfault - but that needs >>>> to be seen. No abort happened (yet) in my lab. >>> >>> I did finally get it to die, as soon as I get into the office I'll look at >>> the end of the debug log >>> >>> the box I am duplicating this problem on relays all the logs it recieves >>> up to another central box. the logs that come through this box are about a >>> tenth of the total logs that the central box gets, and that central box >>> has had no problems. >>> >>> the things that I see as being different are >>> >>> 1. the central box doesn't see the malformed messages (one of the relay >>> boxes would fix that before forwarding it) >>> >>> 2. there are fewer systems sending simultaniously to the central box >>> (there are ~100 boxes sending to the relay that dies, but only a half >>> dozen relay boxes sending to the central box) >>> >>> two of the other relays handle a _far_ higher rate of logs, but from fewer >>> sources (one has one source that spews ~15G of logs/day, the other >>> recieves ~100m logs/day from 6 machines). a third relay has more machines >>> sending it logs, but at a lower rate than those two (but still >>> significantly higher than the one that fails). if there was a problem with >>> load or the number of messages being recieved simultaniously I would >>> expect one of these other three to have more problems than the one that >>> fails on me. >>> >>> 3. a noticable fraction of the logs sent through this relay box are sent >>> by a cron job running on each of ~60 machines that wakes up every min and >>> scrapes a local file, sending all the pending messages, so the incoming >>> messages are a bit burstier than normal, the relaying is still bursty, but >>> it is only one bursty box, not many >>> >>> note that even if this cron job is stopped I still had 4.2.1 die on this >>> relay box, so I don't think that it's the bursty nature of the traffic >>> >>> this is why I'm suspicious of the malformed message handling >>> >>> David Lang >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 26 11:10:21 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 11:10:21 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> David, quick question: you originally quoted the bug with 4.2.1. I just tried to find out the differences to this module and saw that there no official 4.2.1 exists. Did you mean 4.3.1 or an interim snapshot? If you no longer know it, don't try hard to find out. I can probably do well enough with the info I have, but this extra bit may be useful to verify I am looking at the right pieces. Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Tuesday, August 25, 2009 11:57 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > > > First shot at it: > > > > 1. Make sure core dump is written (ulimit -c 999999999) > > 2. Have it abort > > 3. bdb /path/to/binary/tsyslogd > > 4. Core name-of-corefile (usually /core.SOMENBR) > > 5. Enter: bt (for backtrace) > > 6. Enter: info thread (displays threads) > > 7. For each thread: > > 7a. Thread number > > 7b. Bt > > 8. You are done (ctl-d) > > Core was generated by `rsyslogd -c5 -x'. > Program terminated with signal 11, Segmentation fault. > [New process 11534] > [New process 11538] > [New process 11535] > [New process 11537] > [New process 11533] > [New process 11536] > #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 > 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 > characters... */ > (gdb) bt > #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 > #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at > parser.c:260 > #2 0x000000000040b6fc in msgConsumer (notNeeded=, > pBatch=0xa2af98) at syslogd.c:942 > #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) > at queue.c:1818 > #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 > #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 > #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 > #7 0x00007f313de545ad in clone () from /lib/libc.so.6 > #8 0x0000000000000000 in ?? () From david at lang.hm Wed Aug 26 11:31:52 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 26 Aug 2009 02:31:52 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> Message-ID: On Wed, 26 Aug 2009, Rainer Gerhards wrote: > David, > > quick question: you originally quoted the bug with 4.2.1. I just tried to > find out the differences to this module and saw that there no official 4.2.1 > exists. Did you mean 4.3.1 or an interim snapshot? If you no longer know it, > don't try hard to find out. I can probably do well enough with the info I > have, but this extra bit may be useful to verify I am looking at the right > pieces. I thought that it was 4.2.1, it was the v4-stable branch while you were on vacation. I first tried installing 5.x on the box, but it had the name problem, so I went to the 4.x branch instead. I can check on another machine in the morning, but since I didn't label it with the git commit I don't know how accurate anything I will find will be. I haven't ever had rsyslog running in the location that's crashing, so there's no way of knowing how far back the bug has been there. David Lang > Rainer > >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of david at lang.hm >> Sent: Tuesday, August 25, 2009 11:57 PM >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> First shot at it: >>> >>> 1. Make sure core dump is written (ulimit -c 999999999) >>> 2. Have it abort >>> 3. bdb /path/to/binary/tsyslogd >>> 4. Core name-of-corefile (usually /core.SOMENBR) >>> 5. Enter: bt (for backtrace) >>> 6. Enter: info thread (displays threads) >>> 7. For each thread: >>> 7a. Thread number >>> 7b. Bt >>> 8. You are done (ctl-d) >> >> Core was generated by `rsyslogd -c5 -x'. >> Program terminated with signal 11, Segmentation fault. >> [New process 11534] >> [New process 11538] >> [New process 11535] >> [New process 11537] >> [New process 11533] >> [New process 11536] >> #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 >> 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 >> characters... */ >> (gdb) bt >> #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 >> #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at >> parser.c:260 >> #2 0x000000000040b6fc in msgConsumer (notNeeded=, >> pBatch=0xa2af98) at syslogd.c:942 >> #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) >> at queue.c:1818 >> #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 >> #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 >> #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 >> #7 0x00007f313de545ad in clone () from /lib/libc.so.6 >> #8 0x0000000000000000 in ?? () > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Wed Aug 26 11:36:30 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 11:36:30 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD89@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Wednesday, August 26, 2009 11:32 AM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > On Wed, 26 Aug 2009, Rainer Gerhards wrote: > > > David, > > > > quick question: you originally quoted the bug with 4.2.1. I just > tried to > > find out the differences to this module and saw that there no > official 4.2.1 > > exists. Did you mean 4.3.1 or an interim snapshot? If you no longer > know it, > > don't try hard to find out. I can probably do well enough with the > info I > > have, but this extra bit may be useful to verify I am looking at the > right > > pieces. > > I thought that it was 4.2.1, it was the v4-stable branch while you were > on > vacation. I first tried installing 5.x on the box, but it had the name > problem, so I went to the 4.x branch instead. I can check on another > machine in the morning, but since I didn't label it with the git commit > I don't know how accurate anything I will find will be. > That's good enough - it is extremely probable I am looking at the right set of changes (plus, I really need the changes just for the overall picutre, the abort location itself - with the debug log - is sufficient). FYI: I am right now analyzing the crash, but it looks I was a bit too optimistic this morning. I don't yet see anything that can cause the segfault. We probably need to go through some test iterations (I am trying to repro in my lab with some samples I think that may be useful based on what I saw). I think it would be a good idea to focus these tests on v4 (same code as v5), as it seems to be easier to reproduce with it. Is that OK with you? > I haven't ever had rsyslog running in the location that's crashing, so > there's no way of knowing how far back the bug has been there. If you would like to give it a try, it could possibly be that 4.2.0 does not have it. If it has, the location is somewhere different (what I still tend to think, but...). Rainer > > David Lang > > > Rainer > > > >> -----Original Message----- > >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > >> bounces at lists.adiscon.com] On Behalf Of david at lang.hm > >> Sent: Tuesday, August 25, 2009 11:57 PM > >> To: rsyslog-users > >> Subject: Re: [rsyslog] abort in 4.2.1 > >> > >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> > >>> First shot at it: > >>> > >>> 1. Make sure core dump is written (ulimit -c 999999999) > >>> 2. Have it abort > >>> 3. bdb /path/to/binary/tsyslogd > >>> 4. Core name-of-corefile (usually /core.SOMENBR) > >>> 5. Enter: bt (for backtrace) > >>> 6. Enter: info thread (displays threads) > >>> 7. For each thread: > >>> 7a. Thread number > >>> 7b. Bt > >>> 8. You are done (ctl-d) > >> > >> Core was generated by `rsyslogd -c5 -x'. > >> Program terminated with signal 11, Segmentation fault. > >> [New process 11534] > >> [New process 11538] > >> [New process 11535] > >> [New process 11537] > >> [New process 11533] > >> [New process 11536] > >> #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 > >> 222 if(pszMsg[iSrc] == '\0') { /* guard against > \0 > >> characters... */ > >> (gdb) bt > >> #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 > >> #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at > >> parser.c:260 > >> #2 0x000000000040b6fc in msgConsumer (notNeeded= out>, > >> pBatch=0xa2af98) at syslogd.c:942 > >> #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, > pWti=0xa2af70) > >> at queue.c:1818 > >> #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 > >> #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 > >> #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 > >> #7 0x00007f313de545ad in clone () from /lib/libc.so.6 > >> #8 0x0000000000000000 in ?? () > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 26 11:50:04 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 11:50:04 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> David, one more thing. If you still have the core file, could you start up gdb again and then do (gdb) thread 1 (gdb) print sanitizeMessage::pszMsg (gdb) print sanitizeMessage::szSanBuf (gdb) print sanitizeMessage::pMsg (gdb) print *sanitizeMessage::pMsg # note the asterisk! (gdb) print sanitizeMessage::iMaxLine (gdb) print sanitizeMessage::maxDest The following ones likely will yield to no result as they are usually optimized out (moved into registers): (gdb) print sanitizeMessage::iSrc (gdb) print sanitizeMessage::iDst (gdb) print sanitizeMessage::pDst That will tell me if the pointers are ok, and what they actually point to. Based on the addresses I see, I guess that the message object pointer provided is already invalid. But it is hard to verify without the context... Rainer From rgerhards at hq.adiscon.com Wed Aug 26 12:20:09 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 12:20:09 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD8C@GRFEXC.intern.adiscon.com> David, an update: it still would be good if you could obtain the info I asked for below, but it would also useful to know (in addition) if the current v4-stable does experience the problem, too. v4-stable (and most probably the version you had) has code that is different in some key sections. So a test if it fails, too, actually tells more than I initially thought. Please also note that I have seen a potential bug inside the new sanitation code, but I think it is very, very unlikely to be causing the problem. I'll address this starting in v4-devel so a re-test of that version would also be useful once it is there. I will post when I am done with the fixing. Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > Sent: Wednesday, August 26, 2009 11:50 AM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > David, > > one more thing. If you still have the core file, could you start up gdb > again > and then do > > (gdb) thread 1 > (gdb) print sanitizeMessage::pszMsg > (gdb) print sanitizeMessage::szSanBuf > (gdb) print sanitizeMessage::pMsg > (gdb) print *sanitizeMessage::pMsg # note the asterisk! > (gdb) print sanitizeMessage::iMaxLine > (gdb) print sanitizeMessage::maxDest > > The following ones likely will yield to no result as they are usually > optimized out (moved into registers): > (gdb) print sanitizeMessage::iSrc > (gdb) print sanitizeMessage::iDst > (gdb) print sanitizeMessage::pDst > > That will tell me if the pointers are ok, and what they actually point > to. > Based on the addresses I see, I guess that the message object pointer > provided is already invalid. But it is hard to verify without the > context... > > Rainer > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Wed Aug 26 13:00:41 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 13:00:41 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8C@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD8D@GRFEXC.intern.adiscon.com> I actually fixed three issues in the message sanitation code: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=aba8792c8a82ef52a3188ee7 295e501ca21dae3b Note that this fix touches the abort location, but it will almost for sure not fix the segfault issue. I assume that the segfault now simply occurs in the " if(iscntrl((int) pszMsg[iSrc])) { ". I have not yet merged this code in the v5 branch. It is part of v4-beta and v4-devel, v4-stable and previous did not have these issues. The resulted from regressions during the optimizations I did. Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > Sent: Wednesday, August 26, 2009 12:20 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > David, an update: > > it still would be good if you could obtain the info I asked for below, > but it > would also useful to know (in addition) if the current v4-stable does > experience the problem, too. v4-stable (and most probably the version > you > had) has code that is different in some key sections. So a test if it > fails, > too, actually tells more than I initially thought. > > Please also note that I have seen a potential bug inside the new > sanitation > code, but I think it is very, very unlikely to be causing the problem. > I'll > address this starting in v4-devel so a re-test of that version would > also be > useful once it is there. I will post when I am done with the fixing. > > Rainer > > > -----Original Message----- > > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > > Sent: Wednesday, August 26, 2009 11:50 AM > > To: rsyslog-users > > Subject: Re: [rsyslog] abort in 4.2.1 > > > > David, > > > > one more thing. If you still have the core file, could you start up > gdb > > again > > and then do > > > > (gdb) thread 1 > > (gdb) print sanitizeMessage::pszMsg > > (gdb) print sanitizeMessage::szSanBuf > > (gdb) print sanitizeMessage::pMsg > > (gdb) print *sanitizeMessage::pMsg # note the asterisk! > > (gdb) print sanitizeMessage::iMaxLine > > (gdb) print sanitizeMessage::maxDest > > > > The following ones likely will yield to no result as they are usually > > optimized out (moved into registers): > > (gdb) print sanitizeMessage::iSrc > > (gdb) print sanitizeMessage::iDst > > (gdb) print sanitizeMessage::pDst > > > > That will tell me if the pointers are ok, and what they actually > point > > to. > > Based on the addresses I see, I guess that the message object pointer > > provided is already invalid. But it is hard to verify without the > > context... > > > > Rainer > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From david at lang.hm Wed Aug 26 13:04:00 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 26 Aug 2009 04:04:00 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD8C@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8C@GRFEXC.intern.adiscon.com> Message-ID: I have a 10 hour meeting today, but I should be able to get the info you need from the core file. the problem is very rare compared to the number of logs recieved, when I was testing on monday it took quite a while for 5.1.4+ to die, but when I duplicated things yesterday it was sometimes dieing so fast that I couldn't do a start and then a ps and see it running (and since it dumped the core file in / instead of where I ran it from I did this multiple times) I'll test whatever commit you want me to. I would suggest that instead of directing me to a specific branch, just direct me to a commit so that there is no abiguity over what is being tested. David Lang On Wed, 26 Aug 2009, Rainer Gerhards wrote: > Date: Wed, 26 Aug 2009 12:20:09 +0200 > From: Rainer Gerhards > Reply-To: rsyslog-users > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > David, an update: > > it still would be good if you could obtain the info I asked for below, but it > would also useful to know (in addition) if the current v4-stable does > experience the problem, too. v4-stable (and most probably the version you > had) has code that is different in some key sections. So a test if it fails, > too, actually tells more than I initially thought. > > Please also note that I have seen a potential bug inside the new sanitation > code, but I think it is very, very unlikely to be causing the problem. I'll > address this starting in v4-devel so a re-test of that version would also be > useful once it is there. I will post when I am done with the fixing. > > Rainer > >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards >> Sent: Wednesday, August 26, 2009 11:50 AM >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> David, >> >> one more thing. If you still have the core file, could you start up gdb >> again >> and then do >> >> (gdb) thread 1 >> (gdb) print sanitizeMessage::pszMsg >> (gdb) print sanitizeMessage::szSanBuf >> (gdb) print sanitizeMessage::pMsg >> (gdb) print *sanitizeMessage::pMsg # note the asterisk! >> (gdb) print sanitizeMessage::iMaxLine >> (gdb) print sanitizeMessage::maxDest >> >> The following ones likely will yield to no result as they are usually >> optimized out (moved into registers): >> (gdb) print sanitizeMessage::iSrc >> (gdb) print sanitizeMessage::iDst >> (gdb) print sanitizeMessage::pDst >> >> That will tell me if the pointers are ok, and what they actually point >> to. >> Based on the addresses I see, I guess that the message object pointer >> provided is already invalid. But it is hard to verify without the >> context... >> >> Rainer >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From david at lang.hm Wed Aug 26 15:42:08 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 26 Aug 2009 06:42:08 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> Message-ID: (gdb) thread 1 [Switching to thread 1 (process 11534)]#0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) print sanitizeMessage::pszMsg $10 = (uchar *) 0x7f312c001658 "" (gdb) print sanitizeMessage::szSanBuf $11 = "?Z\224J\\002\\010\\031\\025*8\\006+?\\007?\204\\011\\002\\010\\031\\025*8\\006+?\\007?\204\\011g rcpt 'mrcravens at verizon.net'\\012f3b021455: to=, "??\001\000\000\000\000", __align = 3330471509296822577}, bDoLock = 0 '\0', bParseHOSTNAME = 1 '\001', iRefCount = 1, iSeverity = -1, iFacility = -1, offAfterPRI = 0, offMSG = -1, iProtocolVersion = 0, msgFlags = 48, iLenRawMsg = 0, iLenMSG = 0, iLenTAG = 0, iLenHOSTNAME = 0, pszRawMsg = 0x7f312c001658 "", pszHOSTNAME = 0x0, pszRcvdAt3164 = 0x0, pszRcvdAt3339 = 0x0, pszRcvdAt_MySQL = 0x0, pszRcvdAt_PgSQL = 0x0, pszTIMESTAMP3164 = 0x0, pszTIMESTAMP3339 = 0x0, pszTIMESTAMP_MySQL = 0x0, pszTIMESTAMP_PgSQL = 0x0, pCSProgName = 0x0, pCSStrucData = 0x0, pCSAPPNAME = 0x0, pCSPROCID = 0x0, pCSMSGID = 0x0, pInputName = 0xa313b0, pRcvFrom = 0x7f312c0012f0, pRcvFromIP = 0x0, pRuleset = 0x0, ttGenTime = 1251236576, tRcvdAt = { timeType = 2 '\002', month = 8 '\b', day = 25 '\031', hour = 21 '\025', minute = 42 '*', second = 56 '8', secfracPrecision = 6 '\006', OffsetMinute = 0 '\0', OffsetHour = 0 '\0', OffsetMode = 43 '+', year = 2009, secfrac = 647276}, tTIMESTAMP = {timeType = 2 '\002', month = 8 '\b', day = 25 '\031', hour = 21 '\025', minute = 42 '*', second = 56 '8', secfracPrecision = 6 '\006', OffsetMinute = 0 '\0', OffsetHour = 0 '\0', OffsetMode = 43 '+', year = 2009, secfrac = 647276}, szRawMsg = '\0' , "?Z\224J\000\000\000\000\002\b\031\025*8\006\000\000+?\a?\204\t\000\002\b\031\025*8\006\000\000+?\a?\204\t\000g rcpt 'mrcravens at verizon.net'\n\000f3b02", szHOSTNAME = "1455: to= (gdb) print sanitizeMessage::maxDest $15 = 0 (gdb) print sanitizeMessage::iSrc $16 = 129448 (gdb) print sanitizeMessage::iDst $17 = 821 (gdb) print sanitizeMessage::pDst $18 = ( uchar *) 0x40ba8e60 "?Z\224J\\002\\010\\031\\025*8\\006+?\\007?\204\\011\\002\\010\\031\\025*8\\006+?\\007?\204\\011g rcpt 'mrcravens at verizon.net'\\012f3b021455: to= Date: Wed, 26 Aug 2009 11:50:04 +0200 > From: Rainer Gerhards > Reply-To: rsyslog-users > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > David, > > one more thing. If you still have the core file, could you start up gdb again > and then do > > (gdb) thread 1 > (gdb) print sanitizeMessage::pszMsg > (gdb) print sanitizeMessage::szSanBuf > (gdb) print sanitizeMessage::pMsg > (gdb) print *sanitizeMessage::pMsg # note the asterisk! > (gdb) print sanitizeMessage::iMaxLine > (gdb) print sanitizeMessage::maxDest > > The following ones likely will yield to no result as they are usually > optimized out (moved into registers): > (gdb) print sanitizeMessage::iSrc > (gdb) print sanitizeMessage::iDst > (gdb) print sanitizeMessage::pDst > > That will tell me if the pointers are ok, and what they actually point to. > Based on the addresses I see, I guess that the message object pointer > provided is already invalid. But it is hard to verify without the context... > > Rainer > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Wed Aug 26 15:50:38 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 15:50:38 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Wednesday, August 26, 2009 3:42 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > (gdb) thread 1 > [Switching to thread 1 (process 11534)]#0 sanitizeMessage > (pMsg=0x7f312c001530) at parser.c:222 > 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 > characters... */ > > (gdb) print sanitizeMessage::pszMsg > $10 = (uchar *) 0x7f312c001658 "" > (gdb) print sanitizeMessage::szSanBuf > $11 = > "?Z\224J\\002\\010\\031\\025*8\\006+?\\007?\204\\011\\002\\010\\031\\02 On quick look, this looks seriously malformed, so I think either the message object or the pointer to it (more likely) was corrupted some time before it was passed to the function that than malfunctioned. Will look now more in-depth, but it looks like we need to have one of these situations where the bug bites at a totally unrelated section of the code but causes a crash somewhere else. Would it be possible to run the instance under valgrind control? It will run 5 to 10 times slower, but if that would be fast enough, it could (could!) help to pinpoint the root cause. I can talk you through using the tool if you do not have used it before (its quite trivial). Rainer From david at lang.hm Wed Aug 26 15:59:37 2009 From: david at lang.hm (david at lang.hm) Date: Wed, 26 Aug 2009 06:59:37 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> Message-ID: On Wed, 26 Aug 2009, Rainer Gerhards wrote: >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of david at lang.hm >> Sent: Wednesday, August 26, 2009 3:42 PM >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> (gdb) thread 1 >> [Switching to thread 1 (process 11534)]#0 sanitizeMessage >> (pMsg=0x7f312c001530) at parser.c:222 >> 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 >> characters... */ >> >> (gdb) print sanitizeMessage::pszMsg >> $10 = (uchar *) 0x7f312c001658 "" >> (gdb) print sanitizeMessage::szSanBuf >> $11 = >> "?Z\224J\\002\\010\\031\\025*8\\006+?\\007?\204\\011\\002\\010\\031\\02 > > On quick look, this looks seriously malformed, so I think either the message > object or the pointer to it (more likely) was corrupted some time before it > was passed to the function that than malfunctioned. Will look now more > in-depth, but it looks like we need to have one of these situations where the > bug bites at a totally unrelated section of the code but causes a crash > somewhere else. > > Would it be possible to run the instance under valgrind control? It will run > 5 to 10 times slower, but if that would be fast enough, it could (could!) > help to pinpoint the root cause. I can talk you through using the tool if you > do not have used it before (its quite trivial). that would be hard to so for a couple reasons at 5-10 times slower the system may not be able to keep up (even with the 'slower' afternoon traffic) this is running on a very hardened production server, getting valgrind installed there would require permission from the SVP level. David Lang From rgerhards at hq.adiscon.com Wed Aug 26 16:01:08 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Wed, 26 Aug 2009 16:01:08 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com> > > that would be hard to so for a couple reasons > > at 5-10 times slower the system may not be able to keep up (even with > the > 'slower' afternoon traffic) > > this is running on a very hardened production server, getting valgrind > installed there would require permission from the SVP level. > understood. So let me see what else I can come up with :) Rainer From rgerhards at hq.adiscon.com Fri Aug 28 11:45:06 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 28 Aug 2009 11:45:06 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > Sent: Wednesday, August 26, 2009 4:01 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > > > > that would be hard to so for a couple reasons > > > > at 5-10 times slower the system may not be able to keep up (even with > > the > > 'slower' afternoon traffic) > > > > this is running on a very hardened production server, getting > valgrind > > installed there would require permission from the SVP level. > > > > understood. So let me see what else I can come up with :) I tried a lab yesterday where I sent roughly 1.5 billion messages (based on what I saw in the debug logs). Unfortunately, no abort happened. However, my traffic patterns was continous traffic of the same message. So I am now going to create some new tooling that permits me to mimic your traffic pattern much better. That will probably require until early next week. To make this really work, it would be really useful if you could send me some complete messages from your environment. I suggest to forward them via private mail. I hope this is possible. Also, it would be good if you could --enable-rtinst --enable-debug and try out that version on your machine. I am a bit concerned about the speed of the resulting executable, it may be too slow. You do not need to run it in debug mode itself. These option (especially--enable-debug) will activate in-depth runtime checks (assert, will abort when something wrong happens) and my hope is that they will catch the bug closer to the root cause. If so, I would need the gdb abort info (actually enabling debug output would be an option some time later). Please let me know what would be OK with you. Thanks, Rainer From david at lang.hm Fri Aug 28 23:55:25 2009 From: david at lang.hm (david at lang.hm) Date: Fri, 28 Aug 2009 14:55:25 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> Message-ID: On Fri, 28 Aug 2009, Rainer Gerhards wrote: >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards >> >>> >>> that would be hard to so for a couple reasons >>> >>> at 5-10 times slower the system may not be able to keep up (even with >>> the >>> 'slower' afternoon traffic) >>> >>> this is running on a very hardened production server, getting >> valgrind >>> installed there would require permission from the SVP level. >>> >> >> understood. So let me see what else I can come up with :) > > I tried a lab yesterday where I sent roughly 1.5 billion messages (based on > what I saw in the debug logs). Unfortunately, no abort happened. However, my > traffic patterns was continous traffic of the same message. > > So I am now going to create some new tooling that permits me to mimic your > traffic pattern much better. That will probably require until early next > week. To make this really work, it would be really useful if you could send > me some complete messages from your environment. I suggest to forward them > via private mail. I hope this is possible. > > Also, it would be good if you could --enable-rtinst --enable-debug and try > out that version on your machine. I am a bit concerned about the speed of the > resulting executable, it may be too slow. You do not need to run it in debug > mode itself. These option (especially--enable-debug) will activate in-depth > runtime checks (assert, will abort when something wrong happens) and my hope > is that they will catch the bug closer to the root cause. If so, I would need > the gdb abort info (actually enabling debug output would be an option some > time later). > > Please let me know what would be OK with you. I will give this a try. I was going to suggest that since we have the message getting corrupted it may make sense to make a temporary branch that has multiple message buffers and at various times through the message processing it makes a copy of the emssage to the buffer. when the system crashes I will be able to look at the core and see where the message is getting corrupted. I will see about doing a tcpdump at the time that I do this and send it to you (I'll need to check with management, but since we have a contract in place for other reasons I think we can do this) I can't do this late on a friday, but I should be able to do this monday afternoon. David Lang From rgerhards at hq.adiscon.com Mon Aug 31 12:50:49 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 31 Aug 2009 12:50:49 +0200 Subject: [rsyslog] abort in 4.2.1 In-Reply-To: References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> Message-ID: <1251715849.4897.13.camel@rgf11> On Fri, 2009-08-28 at 14:55 -0700, david at lang.hm wrote: > On Fri, 28 Aug 2009, Rainer Gerhards wrote: > > Also, it would be good if you could --enable-rtinst --enable-debug and try > > out that version on your machine. I am a bit concerned about the speed of the > > resulting executable, it may be too slow. You do not need to run it in debug > > mode itself. These option (especially--enable-debug) will activate in-depth > > runtime checks (assert, will abort when something wrong happens) and my hope > > is that they will catch the bug closer to the root cause. If so, I would need > > the gdb abort info (actually enabling debug output would be an option some > > time later). > > > > Please let me know what would be OK with you. > > I will give this a try. > > I was going to suggest that since we have the message getting corrupted it > may make sense to make a temporary branch that has multiple message > buffers and at various times through the message processing it makes a > copy of the emssage to the buffer. when the system crashes I will be able > to look at the core and see where the message is getting corrupted. David, I fear it is even more complicated than that. It looks like not only the message got corrupted but the message object itself. There are already two copies of some of the message elements, and they also look inconsistent - except, if we really had a null message, that is one with no content at all (and generating a message object from a null message, I think, would be a bug in itself - but I am sure there are no such messages in your actual traffic). If you think there could be a real null message, I'd follow that path (will probably do so in any case...). I think that what really happens is that some part of the code runs wild, thus invalidating some random part of the main memory. At some times, it hits queue structures (or the message object that is held by them) and if so, we will see the abort you experience. With that scenario, duplicating the message buffer does not really help, because looking at the corrupted message object would not provide any additional information. However, if that's easy enough to reproduce, it would probably be good if you could send me the core analysis (the backtrace and the print statements) from a few (five maybe?) independent aborts. Maybe they show a pattern. It would probably best to send them via private mail, as I am not sure if they disclose more than they should. > > I will see about doing a tcpdump at the time that I do this and send it to > you (I'll need to check with management, but since we have a contract in > place for other reasons I think we can do this) > That would probably be a good thing. I've made some progress with my testing tool, and I have created a basic version right now. Probably not good enough to mimic your traffic pattern, but closer. I am doing a test run for quite some time now, unfortunately so far without abort. Note that I run into the trouble with UDP - even though I've put some one-ms sleeps into the code, I lose a lot of messages, as it looks even before they hit the wire. It's always real trobulesome to test with UDP... Rainer > I can't do this late on a friday, but I should be able to do this monday > afternoon. > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From rgerhards at hq.adiscon.com Mon Aug 31 15:19:18 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 31 Aug 2009 15:19:18 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> <1251715849.4897.13.camel@rgf11> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FDC7@GRFEXC.intern.adiscon.com> David, quick question: do you have name resolution enabled on the system in question? I am asking because I just got a valgrind violation my lab (but not an abort yet) that points into the name resolution area. Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- > bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards > Sent: Monday, August 31, 2009 12:51 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > On Fri, 2009-08-28 at 14:55 -0700, david at lang.hm wrote: > > On Fri, 28 Aug 2009, Rainer Gerhards wrote: > > > Also, it would be good if you could --enable-rtinst --enable-debug > and try > > > out that version on your machine. I am a bit concerned about the > speed of the > > > resulting executable, it may be too slow. You do not need to run it > in debug > > > mode itself. These option (especially--enable-debug) will activate > in-depth > > > runtime checks (assert, will abort when something wrong happens) > and my hope > > > is that they will catch the bug closer to the root cause. If so, I > would need > > > the gdb abort info (actually enabling debug output would be an > option some > > > time later). > > > > > > Please let me know what would be OK with you. > > > > I will give this a try. > > > > I was going to suggest that since we have the message getting > corrupted it > > may make sense to make a temporary branch that has multiple message > > buffers and at various times through the message processing it makes > a > > copy of the emssage to the buffer. when the system crashes I will be > able > > to look at the core and see where the message is getting corrupted. > > David, I fear it is even more complicated than that. It looks like not > only the message got corrupted but the message object itself. There are > already two copies of some of the message elements, and they also look > inconsistent - except, if we really had a null message, that is one > with > no content at all (and generating a message object from a null message, > I think, would be a bug in itself - but I am sure there are no such > messages in your actual traffic). If you think there could be a real > null message, I'd follow that path (will probably do so in any > case...). > > I think that what really happens is that some part of the code runs > wild, thus invalidating some random part of the main memory. At some > times, it hits queue structures (or the message object that is held by > them) and if so, we will see the abort you experience. With that > scenario, duplicating the message buffer does not really help, because > looking at the corrupted message object would not provide any > additional > information. > > However, if that's easy enough to reproduce, it would probably be good > if you could send me the core analysis (the backtrace and the print > statements) from a few (five maybe?) independent aborts. Maybe they > show > a pattern. It would probably best to send them via private mail, as I > am > not sure if they disclose more than they should. > > > > > I will see about doing a tcpdump at the time that I do this and send > it to > > you (I'll need to check with management, but since we have a contract > in > > place for other reasons I think we can do this) > > > > That would probably be a good thing. I've made some progress with my > testing tool, and I have created a basic version right now. Probably > not > good enough to mimic your traffic pattern, but closer. I am doing a > test > run for quite some time now, unfortunately so far without abort. > > Note that I run into the trouble with UDP - even though I've put some > one-ms sleeps into the code, I lose a lot of messages, as it looks even > before they hit the wire. It's always real trobulesome to test with > UDP... > > Rainer > > I can't do this late on a friday, but I should be able to do this > monday > > afternoon. > > > > David Lang > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com From dag at wieers.com Mon Aug 31 17:01:36 2009 From: dag at wieers.com (Dag Wieers) Date: Mon, 31 Aug 2009 17:01:36 +0200 (CEST) Subject: [rsyslog] Three bugs to stable v2 reported to Red Hat Message-ID: Hi, Last week we reported three bugs to Red Hat that hit us in a single incident :) Each of these three bugs were apparently known "shortcomings" in rsyslog v2.0.6, here is the list: - Bug 519192 - rsyslog server cannot handle more than 1000 open files https://bugzilla.redhat.com/show_bug.cgi?id=519192 - Bug 519201 - rsyslog stops logging locally if remote messages are being queued https://bugzilla.redhat.com/show_bug.cgi?id=519201 - Bug 519203 - Applications block when rsyslog has remote messages queued https://bugzilla.redhat.com/show_bug.cgi?id=519203 We hope we will get fixes for at least the two latter issues, as they seem to be fixed in newer releases. Since it is likely that Red Hat might rebase rsyslog in RHEL5 to a newer stable branch in RHEL5.5, we hope to see a fix before we see a rebased package. The limitation of 1000 open file descriptors however (limitation of select()) is still there in newer rsyslog releases and therefor we are probably forced to work around it. Although I find it personally strange that this limitation is not a more widespread problem. Is everybody using a database backend ? Or are people segregating syslog messages by location/importance ? Feedback welcome ! -- -- dag wieers, dag at wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors] From david at lang.hm Mon Aug 31 17:33:38 2009 From: david at lang.hm (david at lang.hm) Date: Mon, 31 Aug 2009 08:33:38 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <9B6E2A8877C38245BFB15CC491A11DA706FDC7@GRFEXC.intern.adiscon.com> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> <1251715849.4897.13.camel@rgf11> <9B6E2A8877C38245BFB15CC491A11DA706FDC7@GRFEXC.intern.adiscon.com> Message-ID: On Mon, 31 Aug 2009, Rainer Gerhards wrote: > quick question: do you have name resolution enabled on the system in > question? I am asking because I just got a valgrind violation my lab (but not > an abort yet) that points into the name resolution area. no, I run this with -x David Lang > Rainer > >> -----Original Message----- >> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog- >> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards >> Sent: Monday, August 31, 2009 12:51 PM >> To: rsyslog-users >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> On Fri, 2009-08-28 at 14:55 -0700, david at lang.hm wrote: >>> On Fri, 28 Aug 2009, Rainer Gerhards wrote: >>>> Also, it would be good if you could --enable-rtinst --enable-debug >> and try >>>> out that version on your machine. I am a bit concerned about the >> speed of the >>>> resulting executable, it may be too slow. You do not need to run it >> in debug >>>> mode itself. These option (especially--enable-debug) will activate >> in-depth >>>> runtime checks (assert, will abort when something wrong happens) >> and my hope >>>> is that they will catch the bug closer to the root cause. If so, I >> would need >>>> the gdb abort info (actually enabling debug output would be an >> option some >>>> time later). >>>> >>>> Please let me know what would be OK with you. >>> >>> I will give this a try. >>> >>> I was going to suggest that since we have the message getting >> corrupted it >>> may make sense to make a temporary branch that has multiple message >>> buffers and at various times through the message processing it makes >> a >>> copy of the emssage to the buffer. when the system crashes I will be >> able >>> to look at the core and see where the message is getting corrupted. >> >> David, I fear it is even more complicated than that. It looks like not >> only the message got corrupted but the message object itself. There are >> already two copies of some of the message elements, and they also look >> inconsistent - except, if we really had a null message, that is one >> with >> no content at all (and generating a message object from a null message, >> I think, would be a bug in itself - but I am sure there are no such >> messages in your actual traffic). If you think there could be a real >> null message, I'd follow that path (will probably do so in any >> case...). >> >> I think that what really happens is that some part of the code runs >> wild, thus invalidating some random part of the main memory. At some >> times, it hits queue structures (or the message object that is held by >> them) and if so, we will see the abort you experience. With that >> scenario, duplicating the message buffer does not really help, because >> looking at the corrupted message object would not provide any >> additional >> information. >> >> However, if that's easy enough to reproduce, it would probably be good >> if you could send me the core analysis (the backtrace and the print >> statements) from a few (five maybe?) independent aborts. Maybe they >> show >> a pattern. It would probably best to send them via private mail, as I >> am >> not sure if they disclose more than they should. >> >>> >>> I will see about doing a tcpdump at the time that I do this and send >> it to >>> you (I'll need to check with management, but since we have a contract >> in >>> place for other reasons I think we can do this) >>> >> >> That would probably be a good thing. I've made some progress with my >> testing tool, and I have created a basic version right now. Probably >> not >> good enough to mimic your traffic pattern, but closer. I am doing a >> test >> run for quite some time now, unfortunately so far without abort. >> >> Note that I run into the trouble with UDP - even though I've put some >> one-ms sleeps into the code, I lose a lot of messages, as it looks even >> before they hit the wire. It's always real trobulesome to test with >> UDP... >> >> Rainer >>> I can't do this late on a friday, but I should be able to do this >> monday >>> afternoon. >>> >>> David Lang >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >> >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From david at lang.hm Mon Aug 31 17:37:31 2009 From: david at lang.hm (david at lang.hm) Date: Mon, 31 Aug 2009 08:37:31 -0700 (PDT) Subject: [rsyslog] abort in 4.2.1 In-Reply-To: <1251715849.4897.13.camel@rgf11> References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com> <9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com> <1251715849.4897.13.camel@rgf11> Message-ID: On Mon, 31 Aug 2009, Rainer Gerhards wrote: > On Fri, 2009-08-28 at 14:55 -0700, david at lang.hm wrote: >> On Fri, 28 Aug 2009, Rainer Gerhards wrote: >>> Also, it would be good if you could --enable-rtinst --enable-debug and try >>> out that version on your machine. I am a bit concerned about the speed of the >>> resulting executable, it may be too slow. You do not need to run it in debug >>> mode itself. These option (especially--enable-debug) will activate in-depth >>> runtime checks (assert, will abort when something wrong happens) and my hope >>> is that they will catch the bug closer to the root cause. If so, I would need >>> the gdb abort info (actually enabling debug output would be an option some >>> time later). >>> >>> Please let me know what would be OK with you. >> >> I will give this a try. >> >> I was going to suggest that since we have the message getting corrupted it >> may make sense to make a temporary branch that has multiple message >> buffers and at various times through the message processing it makes a >> copy of the emssage to the buffer. when the system crashes I will be able >> to look at the core and see where the message is getting corrupted. > > David, I fear it is even more complicated than that. It looks like not > only the message got corrupted but the message object itself. There are > already two copies of some of the message elements, and they also look > inconsistent - except, if we really had a null message, that is one with > no content at all (and generating a message object from a null message, > I think, would be a bug in itself - but I am sure there are no such > messages in your actual traffic). If you think there could be a real > null message, I'd follow that path (will probably do so in any case...). I know that in some places on my network I am seeing malformed messages that look like they are overflowing one packet and so trying to go into a second packet (with the result being 20 or so characters being the entire contents of the message and showing up as the system name with no actual system tag or message folowing it) it's possible that there are packets with nothing in them, but I am not aware of them. > I think that what really happens is that some part of the code runs > wild, thus invalidating some random part of the main memory. At some > times, it hits queue structures (or the message object that is held by > them) and if so, we will see the abort you experience. With that > scenario, duplicating the message buffer does not really help, because > looking at the corrupted message object would not provide any additional > information. ouch > However, if that's easy enough to reproduce, it would probably be good > if you could send me the core analysis (the backtrace and the print > statements) from a few (five maybe?) independent aborts. Maybe they show > a pattern. It would probably best to send them via private mail, as I am > not sure if they disclose more than they should. I will see about doing that. >> >> I will see about doing a tcpdump at the time that I do this and send it to >> you (I'll need to check with management, but since we have a contract in >> place for other reasons I think we can do this) >> > > That would probably be a good thing. I've made some progress with my > testing tool, and I have created a basic version right now. Probably not > good enough to mimic your traffic pattern, but closer. I am doing a test > run for quite some time now, unfortunately so far without abort. > > Note that I run into the trouble with UDP - even though I've put some > one-ms sleeps into the code, I lose a lot of messages, as it looks even > before they hit the wire. It's always real trobulesome to test with > UDP... interesting. I have been able to get very high transmission rates with UDP without loosing packets. what I did was to use syslog to generate sample messages, captured them with tcpdump, and then used tcpreplay to send them at varying data rates. David Lang > Rainer >> I can't do this late on a friday, but I should be able to do this monday >> afternoon. >> >> David Lang >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From konton at gmail.com Mon Aug 31 20:39:13 2009 From: konton at gmail.com (konton) Date: Mon, 31 Aug 2009 11:39:13 -0700 Subject: [rsyslog] What happens when the remote server is unavailable? Message-ID: <6848943a0908311139j75cb5fabpd287a8778006607b@mail.gmail.com> I have 2 systems that I want to put rsyslog on and I'm investigating my options. One system will be the rsyslog master and will log everything to a database. The other will will forward all its logs to the master syslog through an SSH tunnel. 1. What happens if I have a rule with an action @tunnel_ip:tunnel_port, but I haven't created the tunnel yet? Will rsyslog keep trying to connect to the remote server indefinitely? 2. Will rsyslog queue messages intended for the broken connection until the connection is available, and then send them all once the connection comes up? Thanks, Jason From rgerhards at hq.adiscon.com Mon Aug 31 21:31:11 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 31 Aug 2009 21:31:11 +0200 Subject: [rsyslog] What happens when the remote server is unavailable? References: <6848943a0908311139j75cb5fabpd287a8778006607b@mail.gmail.com> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FDC8@GRFEXC.intern.adiscon.com> > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com > [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of konton > Sent: Monday, August 31, 2009 8:39 PM > To: rsyslog at lists.adiscon.com > Subject: [rsyslog] What happens when the remote server is unavailable? > > I have 2 systems that I want to put rsyslog on and I'm investigating > my options. One system will be the rsyslog master and will log > everything to a database. The other will will forward all its logs to > the master syslog through an SSH tunnel. > > 1. What happens if I have a rule with an action > @tunnel_ip:tunnel_port, but I haven't created the tunnel yet? Will > rsyslog keep trying to connect to the remote server indefinitely? Yes > > 2. Will rsyslog queue messages intended for the broken connection > until the connection is available, and then send them all once the > connection comes up? Depends on your configuration. You can specify what you want it to do. HTH Rainer > > Thanks, > Jason > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Mon Aug 31 21:49:07 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 31 Aug 2009 21:49:07 +0200 Subject: [rsyslog] abort in 4.2.1 References: <000401ca25a1$49d004bd$100013ac@intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD87@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD8A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD97@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FD9A@GRFEXC.intern.adiscon.com><9B6E2A8877C38245BFB15CC491A11DA706FDAF@GRFEXC.intern.adiscon.com><1251715849.4897.13.camel@rgf11> Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FDC9@GRFEXC.intern.adiscon.com> David, my tests (and your responses) made my thinking evolve a bit. I think that the problem is probably related to some corruption of stack variables (not the stack frame itself, but some pointers on it). Among others, one reason is that I don't see any real problems under valgrind, while I still get a violation in some code that looks totally unrelated. The best explanation I can think of is this type of stack-based error, which can *not* be detected by valgrind. Also, a lot of the optimizations I made involved moving (costly) heap memory allocations to (far cheaper) stack memory allocations. So this is another indication that the problem may be routed in the stack area. There is one thing that would help verify this assumption: it would be great if you could run 4.2.0 (.0 is important!) on the system that experiences the problem. If that runs stable, the problem source is very probably in the optimizations I did. If it still crashes, well, then I need a new theory ;) Is this possible? As a side-note: I think that my UDP message loss may partly be related to DNS resolution. I will try this in a lab tomorrow. But I still think a lot of packets never leave the source system. This may be related to the virtual environment I am currently using for the lab. I hope to be able to generate the traffic by a program, because that offers me the flexibility (now and in the future) to test complex messages scenarios (what, granted, does not help if it does not expose the problem...). Rainer > -----Original Message----- > From: rsyslog-bounces at lists.adiscon.com > [mailto:rsyslog-bounces at lists.adiscon.com] On Behalf Of david at lang.hm > Sent: Monday, August 31, 2009 5:38 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > On Mon, 31 Aug 2009, Rainer Gerhards wrote: > > > On Fri, 2009-08-28 at 14:55 -0700, david at lang.hm wrote: > >> On Fri, 28 Aug 2009, Rainer Gerhards wrote: > >>> Also, it would be good if you could --enable-rtinst > --enable-debug and try > >>> out that version on your machine. I am a bit concerned > about the speed of the > >>> resulting executable, it may be too slow. You do not need > to run it in debug > >>> mode itself. These option (especially--enable-debug) will > activate in-depth > >>> runtime checks (assert, will abort when something wrong > happens) and my hope > >>> is that they will catch the bug closer to the root cause. > If so, I would need > >>> the gdb abort info (actually enabling debug output would > be an option some > >>> time later). > >>> > >>> Please let me know what would be OK with you. > >> > >> I will give this a try. > >> > >> I was going to suggest that since we have the message > getting corrupted it > >> may make sense to make a temporary branch that has multiple message > >> buffers and at various times through the message > processing it makes a > >> copy of the emssage to the buffer. when the system crashes > I will be able > >> to look at the core and see where the message is getting corrupted. > > > > David, I fear it is even more complicated than that. It > looks like not > > only the message got corrupted but the message object > itself. There are > > already two copies of some of the message elements, and > they also look > > inconsistent - except, if we really had a null message, > that is one with > > no content at all (and generating a message object from a > null message, > > I think, would be a bug in itself - but I am sure there are no such > > messages in your actual traffic). If you think there could be a real > > null message, I'd follow that path (will probably do so in > any case...). > > I know that in some places on my network I am seeing > malformed messages > that look like they are overflowing one packet and so trying > to go into a > second packet (with the result being 20 or so characters > being the entire > contents of the message and showing up as the system name > with no actual > system tag or message folowing it) > > it's possible that there are packets with nothing in them, > but I am not > aware of them. > > > I think that what really happens is that some part of the code runs > > wild, thus invalidating some random part of the main memory. At some > > times, it hits queue structures (or the message object that > is held by > > them) and if so, we will see the abort you experience. With that > > scenario, duplicating the message buffer does not really > help, because > > looking at the corrupted message object would not provide > any additional > > information. > > ouch > > > However, if that's easy enough to reproduce, it would > probably be good > > if you could send me the core analysis (the backtrace and the print > > statements) from a few (five maybe?) independent aborts. > Maybe they show > > a pattern. It would probably best to send them via private > mail, as I am > > not sure if they disclose more than they should. > > I will see about doing that. > > >> > >> I will see about doing a tcpdump at the time that I do > this and send it to > >> you (I'll need to check with management, but since we have > a contract in > >> place for other reasons I think we can do this) > >> > > > > That would probably be a good thing. I've made some progress with my > > testing tool, and I have created a basic version right now. > Probably not > > good enough to mimic your traffic pattern, but closer. I am > doing a test > > run for quite some time now, unfortunately so far without abort. > > > > Note that I run into the trouble with UDP - even though > I've put some > > one-ms sleeps into the code, I lose a lot of messages, as > it looks even > > before they hit the wire. It's always real trobulesome to test with > > UDP... > > interesting. I have been able to get very high transmission > rates with UDP > without loosing packets. > > what I did was to use syslog to generate sample messages, > captured them > with tcpdump, and then used tcpreplay to send them at varying > data rates. > > David Lang > > > Rainer > >> I can't do this late on a friday, but I should be able to > do this monday > >> afternoon. > >> > >> David Lang > >> _______________________________________________ > >> rsyslog mailing list > >> http://lists.adiscon.net/mailman/listinfo/rsyslog > >> http://www.rsyslog.com > > > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > From rgerhards at hq.adiscon.com Mon Aug 31 22:00:52 2009 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Mon, 31 Aug 2009 22:00:52 +0200 Subject: [rsyslog] Three bugs to stable v2 reported to Red Hat References: Message-ID: <9B6E2A8877C38245BFB15CC491A11DA706FDCA@GRFEXC.intern.adiscon.com> > The limitation of 1000 open file descriptors however (limitation of > select()) is still there in newer rsyslog releases and > therefor we are > probably forced to work around it. Although I find it > personally strange > that this limitation is not a more widespread problem. Is > everybody using > a database backend ? Or are people segregating syslog messages by > location/importance ? I am not sure tha it is a select() limit. I routinely run tests with 2000 tcp connections under Fedora and it works well. An issue, of course, is the per-process file handle limit, which (on many systems) is 1,024. In current releases, you can simple increase that limit via the $MaxOpenFiles directive: http://www.rsyslog.com/doc-rsconf1_maxopenfiles.html I have to admit I have not tested with more than 2,000 TCP connections. Rainer