[Lognorm] regex engine for lognorm

Mon Apr 28 21:14:14 CEST 2014

Hi David,

I am open to any idea that does the job better :)

But I am not sure if my use case is clear and how liblognorm does a
better job? Say, I have 100 log-types (firewall, ssh, proxy, custom
apps, etc etc) in a stream then each log-type needs a unique regex to
normalize the log line. Right?

Now, I have a firehose of logs and I want to be able to feed a stream
off the firehose to an instance of the normalization tier. Each
instance of the normalization tier has all the rules to process the
entire log-type set. This way, as traffic increases, I can almost
infinitely scale horizontally by spinning up a new normalization tier
instance.

So let's say, we have log-types "L" - L1 to L100 and for each
log-type, we have a unique regex to extract the fields and normalize
the line, R1-R100. And, a normalization tier instance is N1. Now, when
L50 hits N1, how does liblognorm avoid going through R1-R100
(actually, it would stop at R50) to find the right match?

Thanks,

Xuri

On Sat, Apr 26, 2014 at 12:18 AM, David Lang <david at lang.hm> wrote:
> On Fri, 25 Apr 2014, Xuri Nagarin wrote:
>
>> Hi,
>>
>> I have been looking for a log normalization engine for log captured
>> via Syslog. The output is to be fed into apps/platforms that like
>> key/value pairs - Hadoop, Lucene based search tools and some kind of
>> stream processor (Storm/Spark).
>>
>> An easy way to feed rules to this engine would be in this format:
>>
>> [descriptor/label]
>> REGEX=someRegexThatExtractsMultipleGroups(1....n)
>> output=$1:time, $2:host, $3:tag, $4:group4, $5:group5 ........ $n:groupn
>>
>> You should be able to specify multiple regex rules in this format that
>> would get evaluated one after the other. Preferably, the engine would
>> internally rank the regex list in the order of most used to least
>> used.
>>
>> Is this something the libnorm project can take on? Combined with the
>> capabilities of rsyslog, this would be an enormously scalable and
>> powerful tool for log analysis because will allow people to maintain a
>> single data dictionary across multiple analysis engines like Hadoop,
>> Search and others.
>>
>> Similar functionality exists in Flume/Logstash via grok but Java
>> simply sucks when it comes to regex parsing.
>
>
> What liblognorm provides is actually more powerful. It takes the patterns
> that you provide (which are similar to regex, but not regex) and then
> compiles them into a parse tree for evaluation. This means that the log only
> needs to be evaluated once, not once for every rule. and the rule that most
> closely matches the log will generate the resulting parsed variables.
>
> Before you throw this away for a regex engine, I would suggest that you
> investigate if it can be used to match your logs as is. If it can, it will
> be FAR faster than any regex engine.
>
> David Lang
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm