[Lognorm] regex engine for lognorm

Sat Apr 26 09:18:54 CEST 2014

On Fri, 25 Apr 2014, Xuri Nagarin wrote:

> Hi,
>
> I have been looking for a log normalization engine for log captured
> via Syslog. The output is to be fed into apps/platforms that like
> key/value pairs - Hadoop, Lucene based search tools and some kind of
> stream processor (Storm/Spark).
>
> An easy way to feed rules to this engine would be in this format:
>
> [descriptor/label]
> REGEX=someRegexThatExtractsMultipleGroups(1....n)
> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>
> You should be able to specify multiple regex rules in this format that
> would get evaluated one after the other. Preferably, the engine would
> internally rank the regex list in the order of most used to least
> used.
>
> Is this something the libnorm project can take on? Combined with the
> capabilities of rsyslog, this would be an enormously scalable and
> powerful tool for log analysis because will allow people to maintain a
> single data dictionary across multiple analysis engines like Hadoop,
> Search and others.
>
> Similar functionality exists in Flume/Logstash via grok but Java
> simply sucks when it comes to regex parsing.

What liblognorm provides is actually more powerful. It takes the patterns that 
you provide (which are similar to regex, but not regex) and then compiles them 
into a parse tree for evaluation. This means that the log only needs to be 
evaluated once, not once for every rule. and the rule that most closely matches 
the log will generate the resulting parsed variables.

Before you throw this away for a regex engine, I would suggest that you 
investigate if it can be used to match your logs as is. If it can, it will be 
FAR faster than any regex engine.

David Lang