From davor.saric at srce.hr  Fri Apr  4 14:36:52 2014
From: davor.saric at srce.hr (Davor Saric)
Date: Fri, 04 Apr 2014 14:36:52 +0200
Subject: [Lognorm] Log normalization and the leading space
Message-ID: <533EA764.2010901@srce.hr>

Hi,

I have a central rsyslog server, and rsyslog clients that ship their 
logs to central rsyslog. Rsyslog clients on servers are v5 and central 
rsyslog is v7. Central rsyslog sends incoming logs of clients to 
elasticsearch and also ship his own local logs of central server. On 
clients, I?m using imfile modul to read apache logs and also use imfile 
on central rsyslog server to ship his own apache logs to elasticsearch. 
The problem is that apache logs that are coming from clients have a 
space in msg part so normalize rule for those logs is:
rule=: %client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% 
%tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" 
%status:word% %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%"

And normalize rule for central his own local apache logs is:
rule=:%client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word% 
%tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" 
%status:word% %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%"

The only difference between the rules is that the one that normalize 
incoming apache logs from the clients has one space at first, and the 
one that normalize local apache logs of central rsyslog server has no space.

Here is template for incoming apache logs and the template for local 
apache logs. I had to use position.from=2 because of the space in msg of 
incoming logs. If I use the same template for local apache logs, the 
first character is cut of which is first number of ip adress of client:

template(name="httpd-access_remote" type="list") {
property(name="msg" position.from="2?)
constant(value="\n")
}

template(name="httpd-access_local" type="list") {
property(name="msg")
constant(value="\n")
}

As I can see, the msg property of incoming apache logs have a space at 
beggining but when reading local logs through imfile the msg property 
doesn't have empty space in the beginning.

With regards,
-- 
Davor Saric, System Engineer
Computer Systems Department

SRCE - University of Zagreb University Computing Center, www.srce.unizg.hr
davor.saric at srce.hr, tel: +385 1 616 58 01, fax: +385 1 616 55 59

From rgerhards at hq.adiscon.com  Fri Apr  4 14:43:30 2014
From: rgerhards at hq.adiscon.com (Rainer Gerhards)
Date: Fri, 4 Apr 2014 14:43:30 +0200
Subject: [Lognorm] Log normalization and the leading space
In-Reply-To: <533EA764.2010901@srce.hr>
References: <533EA764.2010901@srce.hr>
Message-ID: <CADk+mPD55YWdySjKn3TbXaTDruB26aV33r878Gbb5-TV-jm=DQ@mail.gmail.com>

I am CC'ing the rsyslog mailing list as the issue is more related the
rsyslog and syslog in general. I suggest to subscribe in order to receive
follow-ups.

I think the problem you see is based on the fact that RFC3164 - which is
used to parse these types of messages - specifies that everything after the
TAG is the message. Usually, messages have "TAG: mm", note the space before
mm. This is where it stems from.

In regard to lognorm rules, you can simply duplicate the entries with and
without a space in front. It's a bit ugly, but a work-around you can use
right now.

HTH
Rainer


On Fri, Apr 4, 2014 at 2:36 PM, Davor Saric <davor.saric at srce.hr> wrote:

> Hi,
>
> I have a central rsyslog server, and rsyslog clients that ship their logs
> to central rsyslog. Rsyslog clients on servers are v5 and central rsyslog
> is v7. Central rsyslog sends incoming logs of clients to elasticsearch and
> also ship his own local logs of central server. On clients, I?m using
> imfile modul to read apache logs and also use imfile on central rsyslog
> server to ship his own apache logs to elasticsearch. The problem is that
> apache logs that are coming from clients have a space in msg part so
> normalize rule for those logs is:
> rule=: %client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word%
> %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word%
> %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%"
>
> And normalize rule for central his own local apache logs is:
> rule=:%client_ip:word% %rlogname:word% %ruser:word% [%apache_date:word%
> %tz:char-to:]%] "%method:word% %url:word% %pver:char-to:"%" %status:word%
> %bytesend:word% "%referrer:char-to:"%" "%useragent:char-to:"%"
>
> The only difference between the rules is that the one that normalize
> incoming apache logs from the clients has one space at first, and the one
> that normalize local apache logs of central rsyslog server has no space.
>
> Here is template for incoming apache logs and the template for local
> apache logs. I had to use position.from=2 because of the space in msg of
> incoming logs. If I use the same template for local apache logs, the first
> character is cut of which is first number of ip adress of client:
>
> template(name="httpd-access_remote" type="list") {
> property(name="msg" position.from="2?)
> constant(value="\n")
> }
>
> template(name="httpd-access_local" type="list") {
> property(name="msg")
> constant(value="\n")
> }
>
> As I can see, the msg property of incoming apache logs have a space at
> beggining but when reading local logs through imfile the msg property
> doesn't have empty space in the beginning.
>
>
> With regards,
> --
> Davor Saric, System Engineer
> Computer Systems Department
>
> SRCE - University of Zagreb University Computing Center, www.srce.unizg.hr
> davor.saric at srce.hr, tel: +385 1 616 58 01, fax: +385 1 616 55 59
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.adiscon.net/pipermail/lognorm/attachments/20140404/b5fb3b7e/attachment.html>

From davor.saric at srce.hr  Fri Apr  4 15:08:48 2014
From: davor.saric at srce.hr (Davor Saric)
Date: Fri, 04 Apr 2014 15:08:48 +0200
Subject: [Lognorm] Log normalization and the leading space
In-Reply-To: <CADk+mPD55YWdySjKn3TbXaTDruB26aV33r878Gbb5-TV-jm=DQ@mail.gmail.com>
References: <533EA764.2010901@srce.hr>
	<CADk+mPD55YWdySjKn3TbXaTDruB26aV33r878Gbb5-TV-jm=DQ@mail.gmail.com>
Message-ID: <533EAEE0.4090700@srce.hr>

On 04.04.2014 14:43, Rainer Gerhards wrote:
> I am CC'ing the rsyslog mailing list as the issue is more related the
> rsyslog and syslog in general. I suggest to subscribe in order to
> receive follow-ups.

Subscribed :)

> I think the problem you see is based on the fact that RFC3164 - which is
> used to parse these types of messages - specifies that everything after
> the TAG is the message. Usually, messages have "TAG: mm", note the space
> before mm. This is where it stems from.

"sensitive information replaced"

Ok, on client with rsyslog v5 imfile writes to local5 and here is the 
line in local5.log:

Mar 25 13:28:10 hostname apache-access: 123.456.789.000 - - 
[25/Mar/2014:12:40:29 +0100]...

On server with rsyslog v7, his own apache logs with imfile are writen to 
local5 and the line is:
Apr  4 14:48:51 central apache-access: 111.222.333.444 - - 
[04/Apr/2014:14:48:50 +0200]...

You can see that space is present in both log. But when writing rules 
and templates, somehow the central rsyslog registers a space in msg 
property from this incoming logs but does not take space from msg 
property when reading local logs witch are fetched with imfile...

Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central 
rsyslog is Centos 6 with rsyslog v7 stable...

> In regard to lognorm rules, you can simply duplicate the entries with
> and without a space in front. It's a bit ugly, but a work-around you can
> use right now.

If this is normal and is not a bug I allready have two rules and 
templates, one for incoming logs and one for central server local apache 
logs so I have a workaround :)

With regards,
Davor Saric

From rgerhards at hq.adiscon.com  Fri Apr  4 15:21:40 2014
From: rgerhards at hq.adiscon.com (Rainer Gerhards)
Date: Fri, 4 Apr 2014 15:21:40 +0200
Subject: [Lognorm] [rsyslog]  Log normalization and the leading space
In-Reply-To: <533EAEE0.4090700@srce.hr>
References: <533EA764.2010901@srce.hr>
	<CADk+mPD55YWdySjKn3TbXaTDruB26aV33r878Gbb5-TV-jm=DQ@mail.gmail.com>
	<533EAEE0.4090700@srce.hr>
Message-ID: <CADk+mPA7oW_6CaMp2xspNtRYcuDQyeuDCVsPeCEWqdJ8XUO6aA@mail.gmail.com>

On Fri, Apr 4, 2014 at 3:08 PM, Davor Saric <davor.saric at srce.hr> wrote:

> On 04.04.2014 14:43, Rainer Gerhards wrote:
>
>> I am CC'ing the rsyslog mailing list as the issue is more related the
>> rsyslog and syslog in general. I suggest to subscribe in order to
>> receive follow-ups.
>>
>
> Subscribed :)
>
>
>  I think the problem you see is based on the fact that RFC3164 - which is
>> used to parse these types of messages - specifies that everything after
>> the TAG is the message. Usually, messages have "TAG: mm", note the space
>> before mm. This is where it stems from.
>>
>
> "sensitive information replaced"
>
> Ok, on client with rsyslog v5 imfile writes to local5 and here is the line
> in local5.log:
>
> Mar 25 13:28:10 hostname apache-access: 123.456.789.000 - -
> [25/Mar/2014:12:40:29 +0100]...
>
> On server with rsyslog v7, his own apache logs with imfile are writen to
> local5 and the line is:
> Apr  4 14:48:51 central apache-access: 111.222.333.444 - -
> [04/Apr/2014:14:48:50 +0200]...
>
> You can see that space is present in both log. But when writing rules and
> templates, somehow the central rsyslog registers a space in msg property
> from this incoming logs but does not take space from msg property when
> reading local logs witch are fetched with imfile...
>
>
That's because the omfile default template inserts the space if it is not
there.


> Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central rsyslog
> is Centos 6 with rsyslog v7 stable...
>
>
>  In regard to lognorm rules, you can simply duplicate the entries with
>> and without a space in front. It's a bit ugly, but a work-around you can
>> use right now.
>>
>
> If this is normal and is not a bug I allready have two rules and
> templates, one for incoming logs and one for central server local apache
> logs so I have a workaround :)
>

No, its not a bug, but it's still ugly. I always wanted to add an option to
specify a template to be used for mmnormalize (where you could fix these
things), but it does not play well with the message modification module
interface, and so this change actually would be a couple of magnitudes
larger than you'd usually expect. Given the mile-long todo list, this
hasn't happened yet and probably will not in the forseable future :-(

Rainer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.adiscon.net/pipermail/lognorm/attachments/20140404/9fe8d446/attachment.html>

From davor.saric at srce.hr  Fri Apr  4 15:40:29 2014
From: davor.saric at srce.hr (Davor Saric)
Date: Fri, 04 Apr 2014 15:40:29 +0200
Subject: [Lognorm] [rsyslog]  Log normalization and the leading space
In-Reply-To: <CADk+mPA7oW_6CaMp2xspNtRYcuDQyeuDCVsPeCEWqdJ8XUO6aA@mail.gmail.com>
References: <533EA764.2010901@srce.hr>	<CADk+mPD55YWdySjKn3TbXaTDruB26aV33r878Gbb5-TV-jm=DQ@mail.gmail.com>	<533EAEE0.4090700@srce.hr>
	<CADk+mPA7oW_6CaMp2xspNtRYcuDQyeuDCVsPeCEWqdJ8XUO6aA@mail.gmail.com>
Message-ID: <533EB64D.2070707@srce.hr>

On 04.04.2014 15:21, Rainer Gerhards wrote:
> > That's because the omfile default template inserts the space if it is not
> there.

Aha, ok.

>
>> Btw clients are CentOS 6 and Debian 7 with rsyslog v5 and central rsyslog
>> is Centos 6 with rsyslog v7 stable...
>>
>>
>>   In regard to lognorm rules, you can simply duplicate the entries with
>>> and without a space in front. It's a bit ugly, but a work-around you can
>>> use right now.
>>>
>>
>> If this is normal and is not a bug I allready have two rules and
>> templates, one for incoming logs and one for central server local apache
>> logs so I have a workaround :)
>>
>
> No, its not a bug, but it's still ugly. I always wanted to add an option to
> specify a template to be used for mmnormalize (where you could fix these
> things), but it does not play well with the message modification module
> interface, and so this change actually would be a couple of magnitudes
> larger than you'd usually expect. Given the mile-long todo list, this
> hasn't happened yet and probably will not in the forseable future :-(
>

Got it. The important part is when clients got their rsyslog clients 
officially upgraded to v7 that I don't have to change the rules and 
templates in an ongoing production environment :)

Thanks for fast response.

With regards,
Davor Saric


From friedl at adiscon.com  Fri Apr 11 16:25:26 2014
From: friedl at adiscon.com (Florian Riedl)
Date: Fri, 11 Apr 2014 16:25:26 +0200
Subject: [Lognorm] liblognorm 1.0.1 released
Message-ID: <CAAq4--QD1ddBDa+b5Yt55+GDoHF0ZErHj-DHqyMVC408SqFXSg@mail.gmail.com>

Hi all,

We have just released liblognorm 1.0.1. This is a pure maintenance release.

Changes

Version 1.0.1, 2014-04-11

   - improved doc (via RST/Sphinx)
   - bugfix: unparsed fields were copied incorrectly from non-terminated
   string. Thanks to Josh Blum for the fix.
   - bugfix: mandatory tag did not work in lognormalizer

Download:
http://www.liblognorm.com/download/liblognorm-1-0-1/

As always, feedback is appreciated.

Best regards,
Florian Riedl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.adiscon.net/pipermail/lognorm/attachments/20140411/98122394/attachment.html>

From secsubs at gmail.com  Fri Apr 25 22:50:00 2014
From: secsubs at gmail.com (Xuri Nagarin)
Date: Fri, 25 Apr 2014 13:50:00 -0700
Subject: [Lognorm] regex engine for lognorm
Message-ID: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>

Hi,

I have been looking for a log normalization engine for log captured
via Syslog. The output is to be fed into apps/platforms that like
key/value pairs - Hadoop, Lucene based search tools and some kind of
stream processor (Storm/Spark).

An easy way to feed rules to this engine would be in this format:

[descriptor/label]
REGEX=someRegexThatExtractsMultipleGroups(1....n)
output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn

You should be able to specify multiple regex rules in this format that
would get evaluated one after the other. Preferably, the engine would
internally rank the regex list in the order of most used to least
used.

Is this something the libnorm project can take on? Combined with the
capabilities of rsyslog, this would be an enormously scalable and
powerful tool for log analysis because will allow people to maintain a
single data dictionary across multiple analysis engines like Hadoop,
Search and others.

Similar functionality exists in Flume/Logstash via grok but Java
simply sucks when it comes to regex parsing.

Thanks,

- Xuri

From david at lang.hm  Sat Apr 26 09:18:54 2014
From: david at lang.hm (David Lang)
Date: Sat, 26 Apr 2014 00:18:54 -0700 (PDT)
Subject: [Lognorm] regex engine for lognorm
In-Reply-To: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>
References: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>
Message-ID: <alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>

On Fri, 25 Apr 2014, Xuri Nagarin wrote:

> Hi,
>
> I have been looking for a log normalization engine for log captured
> via Syslog. The output is to be fed into apps/platforms that like
> key/value pairs - Hadoop, Lucene based search tools and some kind of
> stream processor (Storm/Spark).
>
> An easy way to feed rules to this engine would be in this format:
>
> [descriptor/label]
> REGEX=someRegexThatExtractsMultipleGroups(1....n)
> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>
> You should be able to specify multiple regex rules in this format that
> would get evaluated one after the other. Preferably, the engine would
> internally rank the regex list in the order of most used to least
> used.
>
> Is this something the libnorm project can take on? Combined with the
> capabilities of rsyslog, this would be an enormously scalable and
> powerful tool for log analysis because will allow people to maintain a
> single data dictionary across multiple analysis engines like Hadoop,
> Search and others.
>
> Similar functionality exists in Flume/Logstash via grok but Java
> simply sucks when it comes to regex parsing.

What liblognorm provides is actually more powerful. It takes the patterns that 
you provide (which are similar to regex, but not regex) and then compiles them 
into a parse tree for evaluation. This means that the log only needs to be 
evaluated once, not once for every rule. and the rule that most closely matches 
the log will generate the resulting parsed variables.

Before you throw this away for a regex engine, I would suggest that you 
investigate if it can be used to match your logs as is. If it can, it will be 
FAR faster than any regex engine.

David Lang

From secsubs at gmail.com  Mon Apr 28 21:14:14 2014
From: secsubs at gmail.com (Xuri Nagarin)
Date: Mon, 28 Apr 2014 12:14:14 -0700
Subject: [Lognorm] regex engine for lognorm
In-Reply-To: <alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>
References: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>
	<alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>
Message-ID: <CADPi3fitnp5rmb-EW4DN4aJLK4QBzFKf7rPuFKZ+VmbSTOwVyA@mail.gmail.com>

Hi David,

I am open to any idea that does the job better :)

But I am not sure if my use case is clear and how liblognorm does a
better job? Say, I have 100 log-types (firewall, ssh, proxy, custom
apps, etc etc) in a stream then each log-type needs a unique regex to
normalize the log line. Right?

Now, I have a firehose of logs and I want to be able to feed a stream
off the firehose to an instance of the normalization tier. Each
instance of the normalization tier has all the rules to process the
entire log-type set. This way, as traffic increases, I can almost
infinitely scale horizontally by spinning up a new normalization tier
instance.

So let's say, we have log-types "L" - L1 to L100 and for each
log-type, we have a unique regex to extract the fields and normalize
the line, R1-R100. And, a normalization tier instance is N1. Now, when
L50 hits N1, how does liblognorm avoid going through R1-R100
(actually, it would stop at R50) to find the right match?

Thanks,

Xuri


On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>
>> Hi,
>>
>> I have been looking for a log normalization engine for log captured
>> via Syslog. The output is to be fed into apps/platforms that like
>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>> stream processor (Storm/Spark).
>>
>> An easy way to feed rules to this engine would be in this format:
>>
>> [descriptor/label]
>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>
>> You should be able to specify multiple regex rules in this format that
>> would get evaluated one after the other. Preferably, the engine would
>> internally rank the regex list in the order of most used to least
>> used.
>>
>> Is this something the libnorm project can take on? Combined with the
>> capabilities of rsyslog, this would be an enormously scalable and
>> powerful tool for log analysis because will allow people to maintain a
>> single data dictionary across multiple analysis engines like Hadoop,
>> Search and others.
>>
>> Similar functionality exists in Flume/Logstash via grok but Java
>> simply sucks when it comes to regex parsing.
>
>
> What liblognorm provides is actually more powerful. It takes the patterns
> that you provide (which are similar to regex, but not regex) and then
> compiles them into a parse tree for evaluation. This means that the log only
> needs to be evaluated once, not once for every rule. and the rule that most
> closely matches the log will generate the resulting parsed variables.
>
> Before you throw this away for a regex engine, I would suggest that you
> investigate if it can be used to match your logs as is. If it can, it will
> be FAR faster than any regex engine.
>
> David Lang
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm

From david at lang.hm  Mon Apr 28 22:13:47 2014
From: david at lang.hm (David Lang)
Date: Mon, 28 Apr 2014 13:13:47 -0700 (PDT)
Subject: [Lognorm] regex engine for lognorm
In-Reply-To: <CADPi3fitnp5rmb-EW4DN4aJLK4QBzFKf7rPuFKZ+VmbSTOwVyA@mail.gmail.com>
References: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>
	<alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>
	<CADPi3fitnp5rmb-EW4DN4aJLK4QBzFKf7rPuFKZ+VmbSTOwVyA@mail.gmail.com>
Message-ID: <alpine.DEB.2.02.1404281305460.14881@nftneq.ynat.uz>

On Mon, 28 Apr 2014, Xuri Nagarin wrote:

> Hi David,
>
> I am open to any idea that does the job better :)
>
> But I am not sure if my use case is clear and how liblognorm does a
> better job? Say, I have 100 log-types (firewall, ssh, proxy, custom
> apps, etc etc) in a stream then each log-type needs a unique regex to
> normalize the log line. Right?
>
> Now, I have a firehose of logs and I want to be able to feed a stream
> off the firehose to an instance of the normalization tier. Each
> instance of the normalization tier has all the rules to process the
> entire log-type set. This way, as traffic increases, I can almost
> infinitely scale horizontally by spinning up a new normalization tier
> instance.
>
> So let's say, we have log-types "L" - L1 to L100 and for each
> log-type, we have a unique regex to extract the fields and normalize
> the line, R1-R100. And, a normalization tier instance is N1. Now, when
> L50 hits N1, how does liblognorm avoid going through R1-R100
> (actually, it would stop at R50) to find the right match?

liblognorm has a completely different way of operating than you are envisoning.

It compiles all the rules into a parse tree and it walks that parse tree _once_ 
and has the log identified


as a trivial example

if you have the following 'rules'

1. approximately
2. apart
3. apple

liblognorm would create a tree

ap - art
   \- p - roximately
      \ - le

so when it tries to match apple, it's not three full comparisons, it starts at 
the beginning, sees 'ap', then it looks and sees that the next character is a 
'p' and go down that branch, then see that the next character is a 'l' and go 
down that branch and match the 3 and say "this is rule 3"

With this sort of matching, the number of rules has virtually no impact on the 
parsing speed, it's just the length of what you are matching.

The liblognorm config language is not as powerful as a full regex, but 
evaluating a full regex is _very_ expensive to do, the simplifications that 
liblognorm makes eliminate the most expensive things in a regex to evaluate, but 
should still be enough to cover your logs. If you find something you can't 
match, speak up and someone can either help you or liblognorm can be enhanced.

David Lang


> Thanks,
>
> Xuri
>
>
>
>
> On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
>> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>>
>>> Hi,
>>>
>>> I have been looking for a log normalization engine for log captured
>>> via Syslog. The output is to be fed into apps/platforms that like
>>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>>> stream processor (Storm/Spark).
>>>
>>> An easy way to feed rules to this engine would be in this format:
>>>
>>> [descriptor/label]
>>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>>
>>> You should be able to specify multiple regex rules in this format that
>>> would get evaluated one after the other. Preferably, the engine would
>>> internally rank the regex list in the order of most used to least
>>> used.
>>>
>>> Is this something the libnorm project can take on? Combined with the
>>> capabilities of rsyslog, this would be an enormously scalable and
>>> powerful tool for log analysis because will allow people to maintain a
>>> single data dictionary across multiple analysis engines like Hadoop,
>>> Search and others.
>>>
>>> Similar functionality exists in Flume/Logstash via grok but Java
>>> simply sucks when it comes to regex parsing.
>>
>>
>> What liblognorm provides is actually more powerful. It takes the patterns
>> that you provide (which are similar to regex, but not regex) and then
>> compiles them into a parse tree for evaluation. This means that the log only
>> needs to be evaluated once, not once for every rule. and the rule that most
>> closely matches the log will generate the resulting parsed variables.
>>
>> Before you throw this away for a regex engine, I would suggest that you
>> investigate if it can be used to match your logs as is. If it can, it will
>> be FAR faster than any regex engine.
>>
>> David Lang
>> _______________________________________________
>> Lognorm mailing list
>> Lognorm at lists.adiscon.com
>> http://lists.adiscon.net/mailman/listinfo/lognorm
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm
>

From secsubs at gmail.com  Wed Apr 30 00:56:23 2014
From: secsubs at gmail.com (Xuri Nagarin)
Date: Tue, 29 Apr 2014 15:56:23 -0700
Subject: [Lognorm] regex engine for lognorm
In-Reply-To: <alpine.DEB.2.02.1404281305460.14881@nftneq.ynat.uz>
References: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>
	<alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>
	<CADPi3fitnp5rmb-EW4DN4aJLK4QBzFKf7rPuFKZ+VmbSTOwVyA@mail.gmail.com>
	<alpine.DEB.2.02.1404281305460.14881@nftneq.ynat.uz>
Message-ID: <CADPi3fgoVUhdpDGybwF-s5UsAVKppWOjnZ81HUrKOYJ2McqyRg@mail.gmail.com>

On Mon, Apr 28, 2014 at 1:13 PM, David Lang <david at lang.hm> wrote:
> liblognorm has a completely different way of operating than you are
> envisoning.
>
> It compiles all the rules into a parse tree and it walks that parse tree
> _once_ and has the log identified
>
>
> as a trivial example
>
> if you have the following 'rules'
>
> 1. approximately
> 2. apart
> 3. apple
>
> liblognorm would create a tree
>
> ap - art
>   \- p - roximately
>      \ - le
>
> so when it tries to match apple, it's not three full comparisons, it starts
> at the beginning, sees 'ap', then it looks and sees that the next character
> is a 'p' and go down that branch, then see that the next character is a 'l'
> and go down that branch and match the 3 and say "this is rule 3"
>
> With this sort of matching, the number of rules has virtually no impact on
> the parsing speed, it's just the length of what you are matching.

I think I understand the disconnect or lack of my understanding here.
Re-reading the liblognorm documentation, I see that you are
implementing a subset of the regex language by defining tokens as
"word", "number" or "ipv4". These are pre-packaged regex expressions.
Supporting a basic set of regex allows you to avoid creating a full
blown regex engine and lets you implement a faster parsing mechanism
like parse trees. But I am wondering if this simplification comes at a
cost of flexibility?

Take for example this log line that I want to break up into key/value pairs:
2014-04-29T21:24:42+00:00 hostnameA.abc.com Oracle Audit[31611]:
LENGTH : '172' ACTION :[021] 'select * from products' DATABASE
USER:[3] 'sys' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6] 'oracle' CLIENT
TERMINAL:[0] '' STATUS:[1] '0' DBID:[10] '2796591309'

/ACTION :[021] 'select * from products'/ needs to get normalized to
"action=/'select * from products'/"

The action or sql text can be of varying length and have varying
number of whitespaces between two keywords. ' select * from products'
is just as valid as 'select       * from    products'. If I am doing
regex, I can use /.+/ followed by the string that is expected to
succeed the "action" value. In this case, use "DATABASE USER:" as a
boundary where "action" ends. Of course, this is easily doable in
regex but not sure of how liblognorm rule language handles it.

>
> The liblognorm config language is not as powerful as a full regex, but
> evaluating a full regex is _very_ expensive to do, the simplifications that
> liblognorm makes eliminate the most expensive things in a regex to evaluate,
> but should still be enough to cover your logs. If you find something you
> can't match, speak up and someone can either help you or liblognorm can be
> enhanced.

Good point. Full regex is expensive. But I'd argue that with the right
design, namely, ability to partition a log stream equally into smaller
streams (something rsyslog does not do today), I can throw hardware at
the problem to scale horizontally, almost infinitely. Hardware is
cheap these days! Also, regex is a more universal language so it is
easier to deploy/maintain from a user's point of view.

Thanks,

- Xuri


>
> David Lang
>
>
>
>> Thanks,
>>
>> Xuri
>>
>>
>>
>>
>> On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
>>>
>>> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been looking for a log normalization engine for log captured
>>>> via Syslog. The output is to be fed into apps/platforms that like
>>>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>>>> stream processor (Storm/Spark).
>>>>
>>>> An easy way to feed rules to this engine would be in this format:
>>>>
>>>> [descriptor/label]
>>>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>>>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>>>
>>>> You should be able to specify multiple regex rules in this format that
>>>> would get evaluated one after the other. Preferably, the engine would
>>>> internally rank the regex list in the order of most used to least
>>>> used.
>>>>
>>>> Is this something the libnorm project can take on? Combined with the
>>>> capabilities of rsyslog, this would be an enormously scalable and
>>>> powerful tool for log analysis because will allow people to maintain a
>>>> single data dictionary across multiple analysis engines like Hadoop,
>>>> Search and others.
>>>>
>>>> Similar functionality exists in Flume/Logstash via grok but Java
>>>> simply sucks when it comes to regex parsing.
>>>
>>>
>>>
>>> What liblognorm provides is actually more powerful. It takes the patterns
>>> that you provide (which are similar to regex, but not regex) and then
>>> compiles them into a parse tree for evaluation. This means that the log
>>> only
>>> needs to be evaluated once, not once for every rule. and the rule that
>>> most
>>> closely matches the log will generate the resulting parsed variables.
>>>
>>> Before you throw this away for a regex engine, I would suggest that you
>>> investigate if it can be used to match your logs as is. If it can, it
>>> will
>>> be FAR faster than any regex engine.
>>>
>>> David Lang
>>> _______________________________________________
>>> Lognorm mailing list
>>> Lognorm at lists.adiscon.com
>>> http://lists.adiscon.net/mailman/listinfo/lognorm
>>
>> _______________________________________________
>> Lognorm mailing list
>> Lognorm at lists.adiscon.com
>> http://lists.adiscon.net/mailman/listinfo/lognorm
>>
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm

From pavel at levshin.spb.ru  Wed Apr 30 07:45:35 2014
From: pavel at levshin.spb.ru (Pavel Levshin)
Date: Wed, 30 Apr 2014 09:45:35 +0400
Subject: [Lognorm] regex engine for lognorm
In-Reply-To: <CADPi3fgoVUhdpDGybwF-s5UsAVKppWOjnZ81HUrKOYJ2McqyRg@mail.gmail.com>
References: <CADPi3fgk8ufPHAXXECKBuqwwTuom2jzrmYqmkjEsJw_2fTpO3A@mail.gmail.com>	<alpine.DEB.2.02.1404260016430.23483@nftneq.ynat.uz>	<CADPi3fitnp5rmb-EW4DN4aJLK4QBzFKf7rPuFKZ+VmbSTOwVyA@mail.gmail.com>	<alpine.DEB.2.02.1404281305460.14881@nftneq.ynat.uz>
	<CADPi3fgoVUhdpDGybwF-s5UsAVKppWOjnZ81HUrKOYJ2McqyRg@mail.gmail.com>
Message-ID: <53608DFF.7080702@levshin.spb.ru>

Hello.

Liblognorm is not a regex engine, nor it tries to be. For the sake of 
flexibility, it actually implements some sort of backtracking while 
searching the tree. It is needed to match variable fields to parsers. 
But it performs its best when the backtracking is not used.

This way of doing things is not as flexible as regex, but it is orders 
of magnitude faster than conventional regex. Believe me, I've tested it.

In your particular example, you may match select clause with "char-to" 
parser.


--
Pavel Levshin


30.04.2014 2:56, Xuri Nagarin:
> On Mon, Apr 28, 2014 at 1:13 PM, David Lang <david at lang.hm> wrote:
>> liblognorm has a completely different way of operating than you are
>> envisoning.
>>
>> It compiles all the rules into a parse tree and it walks that parse tree
>> _once_ and has the log identified
>>
>>
>> as a trivial example
>>
>> if you have the following 'rules'
>>
>> 1. approximately
>> 2. apart
>> 3. apple
>>
>> liblognorm would create a tree
>>
>> ap - art
>>    \- p - roximately
>>       \ - le
>>
>> so when it tries to match apple, it's not three full comparisons, it starts
>> at the beginning, sees 'ap', then it looks and sees that the next character
>> is a 'p' and go down that branch, then see that the next character is a 'l'
>> and go down that branch and match the 3 and say "this is rule 3"
>>
>> With this sort of matching, the number of rules has virtually no impact on
>> the parsing speed, it's just the length of what you are matching.
> I think I understand the disconnect or lack of my understanding here.
> Re-reading the liblognorm documentation, I see that you are
> implementing a subset of the regex language by defining tokens as
> "word", "number" or "ipv4". These are pre-packaged regex expressions.
> Supporting a basic set of regex allows you to avoid creating a full
> blown regex engine and lets you implement a faster parsing mechanism
> like parse trees. But I am wondering if this simplification comes at a
> cost of flexibility?
>
> Take for example this log line that I want to break up into key/value pairs:
> 2014-04-29T21:24:42+00:00 hostnameA.abc.com Oracle Audit[31611]:
> LENGTH : '172' ACTION :[021] 'select * from products' DATABASE
> USER:[3] 'sys' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6] 'oracle' CLIENT
> TERMINAL:[0] '' STATUS:[1] '0' DBID:[10] '2796591309'
>
> /ACTION :[021] 'select * from products'/ needs to get normalized to
> "action=/'select * from products'/"
>
> The action or sql text can be of varying length and have varying
> number of whitespaces between two keywords. ' select * from products'
> is just as valid as 'select       * from    products'. If I am doing
> regex, I can use /.+/ followed by the string that is expected to
> succeed the "action" value. In this case, use "DATABASE USER:" as a
> boundary where "action" ends. Of course, this is easily doable in
> regex but not sure of how liblognorm rule language handles it.
>
>