[rsyslog] Log Normalization effort
david at lang.hm
david at lang.hm
Fri Feb 26 22:32:56 CET 2010
On Fri, 26 Feb 2010, Rainer Gerhards wrote:
> Hi all,
>
> I have blogged about my quest for log normalization. I think there is some
> good information on the upcoming GPLed Adiscon LogAnalyzer and future
> directions for rsyslog in the blog post. So I thought I share the link:
>
> http://blog.gerhards.net/2010/02/syslog-normalization.html
>
> Please note that part of the effort requires community involvement. I would
> be very interested to learn if you think we could win enough support to make
> this a useful effort. I am asking for your feedback, because it will help me
> streamline my priorities for future rsyslog work.
a few comments (but remember that I am usually dealing with high data
rates, so my concerns are biased in that direction)
log analysis is usually done in batches as opposed to in real-time.
some of this is due to the difficulty in doing it in real time, but a lot
of it is the processing overhead (you don't want to take so long to
process an individual request that you miss the next one to arrive)
at low volumes the idea of name-value pairs in the logs makes a lot of
sense, but there is significantly more overhead in parsing a log with
name-value pairs in arbitrary orders than there is in using a tree parsing
approach to analyze known log formats in a fixed order. The message size
can also increase significantly. As a result, at high traffic volumes this
starts to be a bad (or at least questionable) idea.
I would love to see rsyslog gain the ability to efficiently do tree-based
parsing instead of regex parsing. regex parsing is easy to understand and
tinker with, but very expensive to implement. it may be that having
something that 'compiles' a list of regex parsers into a tree parser is
the right answer for usability. I would save several hours of processing
a day if I could easily (and efficiently) make rsyslog write different
logs to different files (at high data rates and with a few hundred
conditions based variations in the syslog tag)
While there are some common events across different types of logs (logins
for example) they almost always contain slightly different data in them. I
also have no faith at all that anyone is going to make much effort to
clean up their logs to make them nicely parseable, and if they do I see
even less chance that they will end up using the same terms for the same
thing. As such I see more value in trying to get samples of logs and what
they mean than in trying to define a normalized version to shoehorn the
logs into. It is worth doing this for some events (logins, failed logins
for example), but I think it's a mistake to think that this will end up
covering all, or even the majority of log messages.
There's also a problem in that the ideal format for the output depends on
what you are doing with the output.
If I could wave a magic wand and get the result I would look for something
like this
the parser starts at the beginning of the message (at the priority) and
can branch on priority/faclilty, timestamp, host, syslogtag, message and
indicate if the message should be parsed into name-value pairs, or split
based on a character (or character sequence like the perl split command
allows) into individually addressable elements (defaulting to whitespace
separated elements), then the format (and if needed dynafile path/file
components) could be constructed from these variables. At any point in the
parsing it should be possible to jump to another parser tree (so that you
could say that sm-mta, sendmail, Sendmail, etc as syslog tags all end up
using the same parser for the message without having to redefine the rules
for each one)
With this capability, people could start writing parser 'branches' to
understand a specific log type and output a 'standard' format (as such a
format can start to be defined).
This can be done in rsyslog today, but it is fairly difficult to define,
and as I understand it, inefficient enough that it's not practical to do
in real-time under heavy load.
If this is fast enough, then the next step would be to add the ability to
have the format/action be 'increment a counter for log type X' and a
signal to rsyslog could generate a report on these counters. Although at
some point it becomes better to feed the message into another opensource
tool (SEC, Simple Event Correlator for example) instead of trying to do
everything in rsyslog.
parsing the file to know what to do with it, and be able to re-format log
messages is very defiantly something that can fit into the rsyslog model
of receiving, formatting, and delivering logs. Alerting on specific log
entries, counting the number of times one thing shows up in logs, and this sort of thing start pushing
beyond the core of rsyslog, and it may be better to feed other tools
instead.
David Lang
More information about the rsyslog
mailing list