From davor.saric at srce.hr Fri Apr 4 14:36:52 2014 From: davor.saric at srce.hr (Davor Saric) Date: Fri, 04 Apr 2014 14:36:52 +0200 Subject: [Lognorm] Log normalization and the leading space Message-ID: <533EA764.2010901@srce.hr> Hi, I have a central rsyslog server, and rsyslog clients that ship their logs to central rsyslog. Rsyslog clients on servers are v5 and central rsyslog is v7. Central rsyslog sends incoming logs of clients to elasticsearch and also ship his own local logs of central server. On clients, I?m using imfile modul to read apache logs and also use imfile on central rsyslog server to ship his own apache logs to elasticsearch. The problem is that apache logs that are coming from clients have a space in msg part so normalize rule for those logs is: rule=: %client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word% %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%" And normalize rule for central his own local apache logs is: rule=:%client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word% %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%" The only difference between the rules is that the one that normalize incoming apache logs from the clients has one space at first, and the one that normalize local apache logs of central rsyslog server has no space. Here is template for incoming apache logs and the template for local apache logs. I had to use position.from=2 because of the space in msg of incoming logs. If I use the same template for local apache logs, the first character is cut of which is first number of ip adress of client: template(name="httpd-access_remote" type="list") { property(name="msg" position.from="2?) constant(value="\n") } template(name="httpd-access_local" type="list") { property(name="msg") constant(value="\n") } As I can see, the msg property of incoming apache logs have a space at beggining but when reading local logs through imfile the msg property doesn't have empty space in the beginning. With regards, -- Davor Saric, System Engineer Computer Systems Department SRCE - University of Zagreb University Computing Center, www.srce.unizg.hr davor.saric at srce.hr, tel: +385 1 616 58 01, fax: +385 1 616 55 59 From rgerhards at hq.adiscon.com Fri Apr 4 14:43:30 2014 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 4 Apr 2014 14:43:30 +0200 Subject: [Lognorm] Log normalization and the leading space In-Reply-To: <533EA764.2010901@srce.hr> References: <533EA764.2010901@srce.hr> Message-ID: I am CC'ing the rsyslog mailing list as the issue is more related the rsyslog and syslog in general. I suggest to subscribe in order to receive follow-ups. I think the problem you see is based on the fact that RFC3164 - which is used to parse these types of messages - specifies that everything after the TAG is the message. Usually, messages have "TAG: mm", note the space before mm. This is where it stems from. In regard to lognorm rules, you can simply duplicate the entries with and without a space in front. It's a bit ugly, but a work-around you can use right now. HTH Rainer On Fri, Apr 4, 2014 at 2:36 PM, Davor Saric wrote: > Hi, > > I have a central rsyslog server, and rsyslog clients that ship their logs > to central rsyslog. Rsyslog clients on servers are v5 and central rsyslog > is v7. Central rsyslog sends incoming logs of clients to elasticsearch and > also ship his own local logs of central server. On clients, I?m using > imfile modul to read apache logs and also use imfile on central rsyslog > server to ship his own apache logs to elasticsearch. The problem is that > apache logs that are coming from clients have a space in msg part so > normalize rule for those logs is: > rule=: %client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% > %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word% > %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%" > > And normalize rule for central his own local apache logs is: > rule=:%client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% > %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word% > %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%" > > The only difference between the rules is that the one that normalize > incoming apache logs from the clients has one space at first, and the one > that normalize local apache logs of central rsyslog server has no space. > > Here is template for incoming apache logs and the template for local > apache logs. I had to use position.from=2 because of the space in msg of > incoming logs. If I use the same template for local apache logs, the first > character is cut of which is first number of ip adress of client: > > template(name="httpd-access_remote" type="list") { > property(name="msg" position.from="2?) > constant(value="\n") > } > > template(name="httpd-access_local" type="list") { > property(name="msg") > constant(value="\n") > } > > As I can see, the msg property of incoming apache logs have a space at > beggining but when reading local logs through imfile the msg property > doesn't have empty space in the beginning. > > > With regards, > -- > Davor Saric, System Engineer > Computer Systems Department > > SRCE - University of Zagreb University Computing Center, www.srce.unizg.hr > davor.saric at srce.hr, tel: +385 1 616 58 01, fax: +385 1 616 55 59 > _______________________________________________ > Lognorm mailing list > Lognorm at lists.adiscon.com > http://lists.adiscon.net/mailman/listinfo/lognorm > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davor.saric at srce.hr Fri Apr 4 15:08:48 2014 From: davor.saric at srce.hr (Davor Saric) Date: Fri, 04 Apr 2014 15:08:48 +0200 Subject: [Lognorm] Log normalization and the leading space In-Reply-To: References: <533EA764.2010901@srce.hr> Message-ID: <533EAEE0.4090700@srce.hr> On 04.04.2014 14:43, Rainer Gerhards wrote: > I am CC'ing the rsyslog mailing list as the issue is more related the > rsyslog and syslog in general. I suggest to subscribe in order to > receive follow-ups. Subscribed :) > I think the problem you see is based on the fact that RFC3164 - which is > used to parse these types of messages - specifies that everything after > the TAG is the message. Usually, messages have "TAG: mm", note the space > before mm. This is where it stems from. "sensitive information replaced" Ok, on client with rsyslog v5 imfile writes to local5 and here is the line in local5.log: Mar 25 13:28:10 hostname apache-access: 123.456.789.000 - - [25/Mar/2014:12:40:29 +0100]... On server with rsyslog v7, his own apache logs with imfile are writen to local5 and the line is: Apr 4 14:48:51 central apache-access: 111.222.333.444 - - [04/Apr/2014:14:48:50 +0200]... You can see that space is present in both log. But when writing rules and templates, somehow the central rsyslog registers a space in msg property from this incoming logs but does not take space from msg property when reading local logs witch are fetched with imfile... Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central rsyslog is Centos 6 with rsyslog v7 stable... > In regard to lognorm rules, you can simply duplicate the entries with > and without a space in front. It's a bit ugly, but a work-around you can > use right now. If this is normal and is not a bug I allready have two rules and templates, one for incoming logs and one for central server local apache logs so I have a workaround :) With regards, Davor Saric From rgerhards at hq.adiscon.com Fri Apr 4 15:21:40 2014 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Fri, 4 Apr 2014 15:21:40 +0200 Subject: [Lognorm] [rsyslog] Log normalization and the leading space In-Reply-To: <533EAEE0.4090700@srce.hr> References: <533EA764.2010901@srce.hr> <533EAEE0.4090700@srce.hr> Message-ID: On Fri, Apr 4, 2014 at 3:08 PM, Davor Saric wrote: > On 04.04.2014 14:43, Rainer Gerhards wrote: > >> I am CC'ing the rsyslog mailing list as the issue is more related the >> rsyslog and syslog in general. I suggest to subscribe in order to >> receive follow-ups. >> > > Subscribed :) > > > I think the problem you see is based on the fact that RFC3164 - which is >> used to parse these types of messages - specifies that everything after >> the TAG is the message. Usually, messages have "TAG: mm", note the space >> before mm. This is where it stems from. >> > > "sensitive information replaced" > > Ok, on client with rsyslog v5 imfile writes to local5 and here is the line > in local5.log: > > Mar 25 13:28:10 hostname apache-access: 123.456.789.000 - - > [25/Mar/2014:12:40:29 +0100]... > > On server with rsyslog v7, his own apache logs with imfile are writen to > local5 and the line is: > Apr 4 14:48:51 central apache-access: 111.222.333.444 - - > [04/Apr/2014:14:48:50 +0200]... > > You can see that space is present in both log. But when writing rules and > templates, somehow the central rsyslog registers a space in msg property > from this incoming logs but does not take space from msg property when > reading local logs witch are fetched with imfile... > > That's because the omfile default template inserts the space if it is not there. > Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central rsyslog > is Centos 6 with rsyslog v7 stable... > > > In regard to lognorm rules, you can simply duplicate the entries with >> and without a space in front. It's a bit ugly, but a work-around you can >> use right now. >> > > If this is normal and is not a bug I allready have two rules and > templates, one for incoming logs and one for central server local apache > logs so I have a workaround :) > No, its not a bug, but it's still ugly. I always wanted to add an option to specify a template to be used for mmnormalize (where you could fix these things), but it does not play well with the message modification module interface, and so this change actually would be a couple of magnitudes larger than you'd usually expect. Given the mile-long todo list, this hasn't happened yet and probably will not in the forseable future :-( Rainer -------------- next part -------------- An HTML attachment was scrubbed... URL: From davor.saric at srce.hr Fri Apr 4 15:40:29 2014 From: davor.saric at srce.hr (Davor Saric) Date: Fri, 04 Apr 2014 15:40:29 +0200 Subject: [Lognorm] [rsyslog] Log normalization and the leading space In-Reply-To: References: <533EA764.2010901@srce.hr> <533EAEE0.4090700@srce.hr> Message-ID: <533EB64D.2070707@srce.hr> On 04.04.2014 15:21, Rainer Gerhards wrote: > > That's because the omfile default template inserts the space if it is not > there. Aha, ok. > >> Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central rsyslog >> is Centos 6 with rsyslog v7 stable... >> >> >> In regard to lognorm rules, you can simply duplicate the entries with >>> and without a space in front. It's a bit ugly, but a work-around you can >>> use right now. >>> >> >> If this is normal and is not a bug I allready have two rules and >> templates, one for incoming logs and one for central server local apache >> logs so I have a workaround :) >> > > No, its not a bug, but it's still ugly. I always wanted to add an option to > specify a template to be used for mmnormalize (where you could fix these > things), but it does not play well with the message modification module > interface, and so this change actually would be a couple of magnitudes > larger than you'd usually expect. Given the mile-long todo list, this > hasn't happened yet and probably will not in the forseable future :-( > Got it. The important part is when clients got their rsyslog clients officially upgraded to v7 that I don't have to change the rules and templates in an ongoing production environment :) Thanks for fast response. With regards, Davor Saric From friedl at adiscon.com Fri Apr 11 16:25:26 2014 From: friedl at adiscon.com (Florian Riedl) Date: Fri, 11 Apr 2014 16:25:26 +0200 Subject: [Lognorm] liblognorm 1.0.1 released Message-ID: Hi all, We have just released liblognorm 1.0.1. This is a pure maintenance release. Changes Version 1.0.1, 2014-04-11 - improved doc (via RST/Sphinx) - bugfix: unparsed fields were copied incorrectly from non-terminated string. Thanks to Josh Blum for the fix. - bugfix: mandatory tag did not work in lognormalizer Download: http://www.liblognorm.com/download/liblognorm-1-0-1/ As always, feedback is appreciated. Best regards, Florian Riedl -------------- next part -------------- An HTML attachment was scrubbed... URL: From secsubs at gmail.com Fri Apr 25 22:50:00 2014 From: secsubs at gmail.com (Xuri Nagarin) Date: Fri, 25 Apr 2014 13:50:00 -0700 Subject: [Lognorm] regex engine for lognorm Message-ID: Hi, I have been looking for a log normalization engine for log captured via Syslog. The output is to be fed into apps/platforms that like key/value pairs - Hadoop, Lucene based search tools and some kind of stream processor (Storm/Spark). An easy way to feed rules to this engine would be in this format: [descriptor/label] REGEX=someRegexThatExtractsMultipleGroups(1....n) output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn You should be able to specify multiple regex rules in this format that would get evaluated one after the other. Preferably, the engine would internally rank the regex list in the order of most used to least used. Is this something the libnorm project can take on? Combined with the capabilities of rsyslog, this would be an enormously scalable and powerful tool for log analysis because will allow people to maintain a single data dictionary across multiple analysis engines like Hadoop, Search and others. Similar functionality exists in Flume/Logstash via grok but Java simply sucks when it comes to regex parsing. Thanks, - Xuri From david at lang.hm Sat Apr 26 09:18:54 2014 From: david at lang.hm (David Lang) Date: Sat, 26 Apr 2014 00:18:54 -0700 (PDT) Subject: [Lognorm] regex engine for lognorm In-Reply-To: References: Message-ID: On Fri, 25 Apr 2014, Xuri Nagarin wrote: > Hi, > > I have been looking for a log normalization engine for log captured > via Syslog. The output is to be fed into apps/platforms that like > key/value pairs - Hadoop, Lucene based search tools and some kind of > stream processor (Storm/Spark). > > An easy way to feed rules to this engine would be in this format: > > [descriptor/label] > REGEX=someRegexThatExtractsMultipleGroups(1....n) > output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn > > You should be able to specify multiple regex rules in this format that > would get evaluated one after the other. Preferably, the engine would > internally rank the regex list in the order of most used to least > used. > > Is this something the libnorm project can take on? Combined with the > capabilities of rsyslog, this would be an enormously scalable and > powerful tool for log analysis because will allow people to maintain a > single data dictionary across multiple analysis engines like Hadoop, > Search and others. > > Similar functionality exists in Flume/Logstash via grok but Java > simply sucks when it comes to regex parsing. What liblognorm provides is actually more powerful. It takes the patterns that you provide (which are similar to regex, but not regex) and then compiles them into a parse tree for evaluation. This means that the log only needs to be evaluated once, not once for every rule. and the rule that most closely matches the log will generate the resulting parsed variables. Before you throw this away for a regex engine, I would suggest that you investigate if it can be used to match your logs as is. If it can, it will be FAR faster than any regex engine. David Lang From secsubs at gmail.com Mon Apr 28 21:14:14 2014 From: secsubs at gmail.com (Xuri Nagarin) Date: Mon, 28 Apr 2014 12:14:14 -0700 Subject: [Lognorm] regex engine for lognorm In-Reply-To: References: Message-ID: Hi David, I am open to any idea that does the job better :) But I am not sure if my use case is clear and how liblognorm does a better job? Say, I have 100 log-types (firewall, ssh, proxy, custom apps, etc etc) in a stream then each log-type needs a unique regex to normalize the log line. Right? Now, I have a firehose of logs and I want to be able to feed a stream off the firehose to an instance of the normalization tier. Each instance of the normalization tier has all the rules to process the entire log-type set. This way, as traffic increases, I can almost infinitely scale horizontally by spinning up a new normalization tier instance. So let's say, we have log-types "L" - L1 to L100 and for each log-type, we have a unique regex to extract the fields and normalize the line, R1-R100. And, a normalization tier instance is N1. Now, when L50 hits N1, how does liblognorm avoid going through R1-R100 (actually, it would stop at R50) to find the right match? Thanks, Xuri On Sat, Apr 26, 2014 at 12:18 AM, David Lang wrote: > On Fri, 25 Apr 2014, Xuri Nagarin wrote: > >> Hi, >> >> I have been looking for a log normalization engine for log captured >> via Syslog. The output is to be fed into apps/platforms that like >> key/value pairs - Hadoop, Lucene based search tools and some kind of >> stream processor (Storm/Spark). >> >> An easy way to feed rules to this engine would be in this format: >> >> [descriptor/label] >> REGEX=someRegexThatExtractsMultipleGroups(1....n) >> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn >> >> You should be able to specify multiple regex rules in this format that >> would get evaluated one after the other. Preferably, the engine would >> internally rank the regex list in the order of most used to least >> used. >> >> Is this something the libnorm project can take on? Combined with the >> capabilities of rsyslog, this would be an enormously scalable and >> powerful tool for log analysis because will allow people to maintain a >> single data dictionary across multiple analysis engines like Hadoop, >> Search and others. >> >> Similar functionality exists in Flume/Logstash via grok but Java >> simply sucks when it comes to regex parsing. > > > What liblognorm provides is actually more powerful. It takes the patterns > that you provide (which are similar to regex, but not regex) and then > compiles them into a parse tree for evaluation. This means that the log only > needs to be evaluated once, not once for every rule. and the rule that most > closely matches the log will generate the resulting parsed variables. > > Before you throw this away for a regex engine, I would suggest that you > investigate if it can be used to match your logs as is. If it can, it will > be FAR faster than any regex engine. > > David Lang > _______________________________________________ > Lognorm mailing list > Lognorm at lists.adiscon.com > http://lists.adiscon.net/mailman/listinfo/lognorm From david at lang.hm Mon Apr 28 22:13:47 2014 From: david at lang.hm (David Lang) Date: Mon, 28 Apr 2014 13:13:47 -0700 (PDT) Subject: [Lognorm] regex engine for lognorm In-Reply-To: References: Message-ID: On Mon, 28 Apr 2014, Xuri Nagarin wrote: > Hi David, > > I am open to any idea that does the job better :) > > But I am not sure if my use case is clear and how liblognorm does a > better job? Say, I have 100 log-types (firewall, ssh, proxy, custom > apps, etc etc) in a stream then each log-type needs a unique regex to > normalize the log line. Right? > > Now, I have a firehose of logs and I want to be able to feed a stream > off the firehose to an instance of the normalization tier. Each > instance of the normalization tier has all the rules to process the > entire log-type set. This way, as traffic increases, I can almost > infinitely scale horizontally by spinning up a new normalization tier > instance. > > So let's say, we have log-types "L" - L1 to L100 and for each > log-type, we have a unique regex to extract the fields and normalize > the line, R1-R100. And, a normalization tier instance is N1. Now, when > L50 hits N1, how does liblognorm avoid going through R1-R100 > (actually, it would stop at R50) to find the right match? liblognorm has a completely different way of operating than you are envisoning. It compiles all the rules into a parse tree and it walks that parse tree _once_ and has the log identified as a trivial example if you have the following 'rules' 1. approximately 2. apart 3. apple liblognorm would create a tree ap - art \- p - roximately \ - le so when it tries to match apple, it's not three full comparisons, it starts at the beginning, sees 'ap', then it looks and sees that the next character is a 'p' and go down that branch, then see that the next character is a 'l' and go down that branch and match the 3 and say "this is rule 3" With this sort of matching, the number of rules has virtually no impact on the parsing speed, it's just the length of what you are matching. The liblognorm config language is not as powerful as a full regex, but evaluating a full regex is _very_ expensive to do, the simplifications that liblognorm makes eliminate the most expensive things in a regex to evaluate, but should still be enough to cover your logs. If you find something you can't match, speak up and someone can either help you or liblognorm can be enhanced. David Lang > Thanks, > > Xuri > > > > > On Sat, Apr 26, 2014 at 12:18 AM, David Lang wrote: >> On Fri, 25 Apr 2014, Xuri Nagarin wrote: >> >>> Hi, >>> >>> I have been looking for a log normalization engine for log captured >>> via Syslog. The output is to be fed into apps/platforms that like >>> key/value pairs - Hadoop, Lucene based search tools and some kind of >>> stream processor (Storm/Spark). >>> >>> An easy way to feed rules to this engine would be in this format: >>> >>> [descriptor/label] >>> REGEX=someRegexThatExtractsMultipleGroups(1....n) >>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn >>> >>> You should be able to specify multiple regex rules in this format that >>> would get evaluated one after the other. Preferably, the engine would >>> internally rank the regex list in the order of most used to least >>> used. >>> >>> Is this something the libnorm project can take on? Combined with the >>> capabilities of rsyslog, this would be an enormously scalable and >>> powerful tool for log analysis because will allow people to maintain a >>> single data dictionary across multiple analysis engines like Hadoop, >>> Search and others. >>> >>> Similar functionality exists in Flume/Logstash via grok but Java >>> simply sucks when it comes to regex parsing. >> >> >> What liblognorm provides is actually more powerful. It takes the patterns >> that you provide (which are similar to regex, but not regex) and then >> compiles them into a parse tree for evaluation. This means that the log only >> needs to be evaluated once, not once for every rule. and the rule that most >> closely matches the log will generate the resulting parsed variables. >> >> Before you throw this away for a regex engine, I would suggest that you >> investigate if it can be used to match your logs as is. If it can, it will >> be FAR faster than any regex engine. >> >> David Lang >> _______________________________________________ >> Lognorm mailing list >> Lognorm at lists.adiscon.com >> http://lists.adiscon.net/mailman/listinfo/lognorm > _______________________________________________ > Lognorm mailing list > Lognorm at lists.adiscon.com > http://lists.adiscon.net/mailman/listinfo/lognorm > From secsubs at gmail.com Wed Apr 30 00:56:23 2014 From: secsubs at gmail.com (Xuri Nagarin) Date: Tue, 29 Apr 2014 15:56:23 -0700 Subject: [Lognorm] regex engine for lognorm In-Reply-To: References: Message-ID: On Mon, Apr 28, 2014 at 1:13 PM, David Lang wrote: > liblognorm has a completely different way of operating than you are > envisoning. > > It compiles all the rules into a parse tree and it walks that parse tree > _once_ and has the log identified > > > as a trivial example > > if you have the following 'rules' > > 1. approximately > 2. apart > 3. apple > > liblognorm would create a tree > > ap - art > \- p - roximately > \ - le > > so when it tries to match apple, it's not three full comparisons, it starts > at the beginning, sees 'ap', then it looks and sees that the next character > is a 'p' and go down that branch, then see that the next character is a 'l' > and go down that branch and match the 3 and say "this is rule 3" > > With this sort of matching, the number of rules has virtually no impact on > the parsing speed, it's just the length of what you are matching. I think I understand the disconnect or lack of my understanding here. Re-reading the liblognorm documentation, I see that you are implementing a subset of the regex language by defining tokens as "word", "number" or "ipv4". These are pre-packaged regex expressions. Supporting a basic set of regex allows you to avoid creating a full blown regex engine and lets you implement a faster parsing mechanism like parse trees. But I am wondering if this simplification comes at a cost of flexibility? Take for example this log line that I want to break up into key/value pairs: 2014-04-29T21:24:42+00:00 hostnameA.abc.com Oracle Audit[31611]: LENGTH : '172' ACTION :[021] 'select * from products' DATABASE USER:[3] 'sys' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6] 'oracle' CLIENT TERMINAL:[0] '' STATUS:[1] '0' DBID:[10] '2796591309' /ACTION :[021] 'select * from products'/ needs to get normalized to "action=/'select * from products'/" The action or sql text can be of varying length and have varying number of whitespaces between two keywords. ' select * from products' is just as valid as 'select * from products'. If I am doing regex, I can use /.+/ followed by the string that is expected to succeed the "action" value. In this case, use "DATABASE USER:" as a boundary where "action" ends. Of course, this is easily doable in regex but not sure of how liblognorm rule language handles it. > > The liblognorm config language is not as powerful as a full regex, but > evaluating a full regex is _very_ expensive to do, the simplifications that > liblognorm makes eliminate the most expensive things in a regex to evaluate, > but should still be enough to cover your logs. If you find something you > can't match, speak up and someone can either help you or liblognorm can be > enhanced. Good point. Full regex is expensive. But I'd argue that with the right design, namely, ability to partition a log stream equally into smaller streams (something rsyslog does not do today), I can throw hardware at the problem to scale horizontally, almost infinitely. Hardware is cheap these days! Also, regex is a more universal language so it is easier to deploy/maintain from a user's point of view. Thanks, - Xuri > > David Lang > > > >> Thanks, >> >> Xuri >> >> >> >> >> On Sat, Apr 26, 2014 at 12:18 AM, David Lang wrote: >>> >>> On Fri, 25 Apr 2014, Xuri Nagarin wrote: >>> >>>> Hi, >>>> >>>> I have been looking for a log normalization engine for log captured >>>> via Syslog. The output is to be fed into apps/platforms that like >>>> key/value pairs - Hadoop, Lucene based search tools and some kind of >>>> stream processor (Storm/Spark). >>>> >>>> An easy way to feed rules to this engine would be in this format: >>>> >>>> [descriptor/label] >>>> REGEX=someRegexThatExtractsMultipleGroups(1....n) >>>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn >>>> >>>> You should be able to specify multiple regex rules in this format that >>>> would get evaluated one after the other. Preferably, the engine would >>>> internally rank the regex list in the order of most used to least >>>> used. >>>> >>>> Is this something the libnorm project can take on? Combined with the >>>> capabilities of rsyslog, this would be an enormously scalable and >>>> powerful tool for log analysis because will allow people to maintain a >>>> single data dictionary across multiple analysis engines like Hadoop, >>>> Search and others. >>>> >>>> Similar functionality exists in Flume/Logstash via grok but Java >>>> simply sucks when it comes to regex parsing. >>> >>> >>> >>> What liblognorm provides is actually more powerful. It takes the patterns >>> that you provide (which are similar to regex, but not regex) and then >>> compiles them into a parse tree for evaluation. This means that the log >>> only >>> needs to be evaluated once, not once for every rule. and the rule that >>> most >>> closely matches the log will generate the resulting parsed variables. >>> >>> Before you throw this away for a regex engine, I would suggest that you >>> investigate if it can be used to match your logs as is. If it can, it >>> will >>> be FAR faster than any regex engine. >>> >>> David Lang >>> _______________________________________________ >>> Lognorm mailing list >>> Lognorm at lists.adiscon.com >>> http://lists.adiscon.net/mailman/listinfo/lognorm >> >> _______________________________________________ >> Lognorm mailing list >> Lognorm at lists.adiscon.com >> http://lists.adiscon.net/mailman/listinfo/lognorm >> > _______________________________________________ > Lognorm mailing list > Lognorm at lists.adiscon.com > http://lists.adiscon.net/mailman/listinfo/lognorm From pavel at levshin.spb.ru Wed Apr 30 07:45:35 2014 From: pavel at levshin.spb.ru (Pavel Levshin) Date: Wed, 30 Apr 2014 09:45:35 +0400 Subject: [Lognorm] regex engine for lognorm In-Reply-To: References: Message-ID: <53608DFF.7080702@levshin.spb.ru> Hello. Liblognorm is not a regex engine, nor it tries to be. For the sake of flexibility, it actually implements some sort of backtracking while searching the tree. It is needed to match variable fields to parsers. But it performs its best when the backtracking is not used. This way of doing things is not as flexible as regex, but it is orders of magnitude faster than conventional regex. Believe me, I've tested it. In your particular example, you may match select clause with "char-to" parser. -- Pavel Levshin 30.04.2014 2:56, Xuri Nagarin: > On Mon, Apr 28, 2014 at 1:13 PM, David Lang wrote: >> liblognorm has a completely different way of operating than you are >> envisoning. >> >> It compiles all the rules into a parse tree and it walks that parse tree >> _once_ and has the log identified >> >> >> as a trivial example >> >> if you have the following 'rules' >> >> 1. approximately >> 2. apart >> 3. apple >> >> liblognorm would create a tree >> >> ap - art >> \- p - roximately >> \ - le >> >> so when it tries to match apple, it's not three full comparisons, it starts >> at the beginning, sees 'ap', then it looks and sees that the next character >> is a 'p' and go down that branch, then see that the next character is a 'l' >> and go down that branch and match the 3 and say "this is rule 3" >> >> With this sort of matching, the number of rules has virtually no impact on >> the parsing speed, it's just the length of what you are matching. > I think I understand the disconnect or lack of my understanding here. > Re-reading the liblognorm documentation, I see that you are > implementing a subset of the regex language by defining tokens as > "word", "number" or "ipv4". These are pre-packaged regex expressions. > Supporting a basic set of regex allows you to avoid creating a full > blown regex engine and lets you implement a faster parsing mechanism > like parse trees. But I am wondering if this simplification comes at a > cost of flexibility? > > Take for example this log line that I want to break up into key/value pairs: > 2014-04-29T21:24:42+00:00 hostnameA.abc.com Oracle Audit[31611]: > LENGTH : '172' ACTION :[021] 'select * from products' DATABASE > USER:[3] 'sys' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6] 'oracle' CLIENT > TERMINAL:[0] '' STATUS:[1] '0' DBID:[10] '2796591309' > > /ACTION :[021] 'select * from products'/ needs to get normalized to > "action=/'select * from products'/" > > The action or sql text can be of varying length and have varying > number of whitespaces between two keywords. ' select * from products' > is just as valid as 'select * from products'. If I am doing > regex, I can use /.+/ followed by the string that is expected to > succeed the "action" value. In this case, use "DATABASE USER:" as a > boundary where "action" ends. Of course, this is easily doable in > regex but not sure of how liblognorm rule language handles it. > >