[Lognorm] regex engine for lognorm

Mon Apr 28 22:13:47 CEST 2014

On Mon, 28 Apr 2014, Xuri Nagarin wrote:

> Hi David,
>
> I am open to any idea that does the job better :)
>
> But I am not sure if my use case is clear and how liblognorm does a
> better job? Say, I have 100 log-types (firewall, ssh, proxy, custom
> apps, etc etc) in a stream then each log-type needs a unique regex to
> normalize the log line. Right?
>
> Now, I have a firehose of logs and I want to be able to feed a stream
> off the firehose to an instance of the normalization tier. Each
> instance of the normalization tier has all the rules to process the
> entire log-type set. This way, as traffic increases, I can almost
> infinitely scale horizontally by spinning up a new normalization tier
> instance.
>
> So let's say, we have log-types "L" - L1 to L100 and for each
> log-type, we have a unique regex to extract the fields and normalize
> the line, R1-R100. And, a normalization tier instance is N1. Now, when
> L50 hits N1, how does liblognorm avoid going through R1-R100
> (actually, it would stop at R50) to find the right match?

liblognorm has a completely different way of operating than you are envisoning.

It compiles all the rules into a parse tree and it walks that parse tree _once_ 
and has the log identified

as a trivial example

if you have the following 'rules'

1. approximately
2. apart
3. apple

liblognorm would create a tree

ap - art
   \- p - roximately
      \ - le

so when it tries to match apple, it's not three full comparisons, it starts at 
the beginning, sees 'ap', then it looks and sees that the next character is a 
'p' and go down that branch, then see that the next character is a 'l' and go 
down that branch and match the 3 and say "this is rule 3"

With this sort of matching, the number of rules has virtually no impact on the 
parsing speed, it's just the length of what you are matching.

The liblognorm config language is not as powerful as a full regex, but 
evaluating a full regex is _very_ expensive to do, the simplifications that 
liblognorm makes eliminate the most expensive things in a regex to evaluate, but 
should still be enough to cover your logs. If you find something you can't 
match, speak up and someone can either help you or liblognorm can be enhanced.

David Lang

> Thanks,
>
> Xuri
>
>
>
>
> On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
>> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>>
>>> Hi,
>>>
>>> I have been looking for a log normalization engine for log captured
>>> via Syslog. The output is to be fed into apps/platforms that like
>>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>>> stream processor (Storm/Spark).
>>>
>>> An easy way to feed rules to this engine would be in this format:
>>>
>>> [descriptor/label]
>>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>>
>>> You should be able to specify multiple regex rules in this format that
>>> would get evaluated one after the other. Preferably, the engine would
>>> internally rank the regex list in the order of most used to least
>>> used.
>>>
>>> Is this something the libnorm project can take on? Combined with the
>>> capabilities of rsyslog, this would be an enormously scalable and
>>> powerful tool for log analysis because will allow people to maintain a
>>> single data dictionary across multiple analysis engines like Hadoop,
>>> Search and others.
>>>
>>> Similar functionality exists in Flume/Logstash via grok but Java
>>> simply sucks when it comes to regex parsing.
>>
>>
>> What liblognorm provides is actually more powerful. It takes the patterns
>> that you provide (which are similar to regex, but not regex) and then
>> compiles them into a parse tree for evaluation. This means that the log only
>> needs to be evaluated once, not once for every rule. and the rule that most
>> closely matches the log will generate the resulting parsed variables.
>>
>> Before you throw this away for a regex engine, I would suggest that you
>> investigate if it can be used to match your logs as is. If it can, it will
>> be FAR faster than any regex engine.
>>
>> David Lang
>> _______________________________________________
>> Lognorm mailing list
>> Lognorm at lists.adiscon.com
>> http://lists.adiscon.net/mailman/listinfo/lognorm
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm
>