[Lognorm] regex engine for lognorm

Wed Apr 30 00:56:23 CEST 2014

On Mon, Apr 28, 2014 at 1:13 PM, David Lang <david at lang.hm> wrote:
> liblognorm has a completely different way of operating than you are
> envisoning.
>
> It compiles all the rules into a parse tree and it walks that parse tree
> _once_ and has the log identified
>
>
> as a trivial example
>
> if you have the following 'rules'
>
> 1. approximately
> 2. apart
> 3. apple
>
> liblognorm would create a tree
>
> ap - art
>   \- p - roximately
>      \ - le
>
> so when it tries to match apple, it's not three full comparisons, it starts
> at the beginning, sees 'ap', then it looks and sees that the next character
> is a 'p' and go down that branch, then see that the next character is a 'l'
> and go down that branch and match the 3 and say "this is rule 3"
>
> With this sort of matching, the number of rules has virtually no impact on
> the parsing speed, it's just the length of what you are matching.

I think I understand the disconnect or lack of my understanding here.
Re-reading the liblognorm documentation, I see that you are
implementing a subset of the regex language by defining tokens as
"word", "number" or "ipv4". These are pre-packaged regex expressions.
Supporting a basic set of regex allows you to avoid creating a full
blown regex engine and lets you implement a faster parsing mechanism
like parse trees. But I am wondering if this simplification comes at a
cost of flexibility?

Take for example this log line that I want to break up into key/value pairs:
2014-04-29T21:24:42+00:00 hostnameA.abc.com Oracle Audit[31611]:
LENGTH : '172' ACTION :[021] 'select * from products' DATABASE
USER:[3] 'sys' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6] 'oracle' CLIENT
TERMINAL:[0] '' STATUS:[1] '0' DBID:[10] '2796591309'

/ACTION :[021] 'select * from products'/ needs to get normalized to
"action=/'select * from products'/"

The action or sql text can be of varying length and have varying
number of whitespaces between two keywords. ' select * from products'
is just as valid as 'select       * from    products'. If I am doing
regex, I can use /.+/ followed by the string that is expected to
succeed the "action" value. In this case, use "DATABASE USER:" as a
boundary where "action" ends. Of course, this is easily doable in
regex but not sure of how liblognorm rule language handles it.

>
> The liblognorm config language is not as powerful as a full regex, but
> evaluating a full regex is _very_ expensive to do, the simplifications that
> liblognorm makes eliminate the most expensive things in a regex to evaluate,
> but should still be enough to cover your logs. If you find something you
> can't match, speak up and someone can either help you or liblognorm can be
> enhanced.

Good point. Full regex is expensive. But I'd argue that with the right
design, namely, ability to partition a log stream equally into smaller
streams (something rsyslog does not do today), I can throw hardware at
the problem to scale horizontally, almost infinitely. Hardware is
cheap these days! Also, regex is a more universal language so it is
easier to deploy/maintain from a user's point of view.

Thanks,

- Xuri

>
> David Lang
>
>
>
>> Thanks,
>>
>> Xuri
>>
>>
>>
>>
>> On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
>>>
>>> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been looking for a log normalization engine for log captured
>>>> via Syslog. The output is to be fed into apps/platforms that like
>>>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>>>> stream processor (Storm/Spark).
>>>>
>>>> An easy way to feed rules to this engine would be in this format:
>>>>
>>>> [descriptor/label]
>>>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>>>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>>>
>>>> You should be able to specify multiple regex rules in this format that
>>>> would get evaluated one after the other. Preferably, the engine would
>>>> internally rank the regex list in the order of most used to least
>>>> used.
>>>>
>>>> Is this something the libnorm project can take on? Combined with the
>>>> capabilities of rsyslog, this would be an enormously scalable and
>>>> powerful tool for log analysis because will allow people to maintain a
>>>> single data dictionary across multiple analysis engines like Hadoop,
>>>> Search and others.
>>>>
>>>> Similar functionality exists in Flume/Logstash via grok but Java
>>>> simply sucks when it comes to regex parsing.
>>>
>>>
>>>
>>> What liblognorm provides is actually more powerful. It takes the patterns
>>> that you provide (which are similar to regex, but not regex) and then
>>> compiles them into a parse tree for evaluation. This means that the log
>>> only
>>> needs to be evaluated once, not once for every rule. and the rule that
>>> most
>>> closely matches the log will generate the resulting parsed variables.
>>>
>>> Before you throw this away for a regex engine, I would suggest that you
>>> investigate if it can be used to match your logs as is. If it can, it
>>> will
>>> be FAR faster than any regex engine.
>>>
>>> David Lang
>>> _______________________________________________
>>> Lognorm mailing list
>>> Lognorm at lists.adiscon.com
>>> http://lists.adiscon.net/mailman/listinfo/lognorm
>>
>> _______________________________________________
>> Lognorm mailing list
>> Lognorm at lists.adiscon.com
>> http://lists.adiscon.net/mailman/listinfo/lognorm
>>
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm