From pavel at levshin.spb.ru Wed Oct 30 20:51:17 2013 From: pavel at levshin.spb.ru (Pavel Levshin) Date: Wed, 30 Oct 2013 23:51:17 +0400 Subject: [Lognorm] [rsyslog] liblognorm In-Reply-To: References: <526E77CC.6040904@levshin.spb.ru> Message-ID: <52716335.9090505@levshin.spb.ru> So, I have taken the opportunity and refactored liblognorm to use json-c instead of libee. Some parts of libee now present in liblognorm, notably field parsers and encoders. They were rewritten to get rid of libee data structures. In the same time, many bugs were fixed, and many were undoubtedly produced. Current state of the library can be seen here: https://github.com/flicker581/liblognorm/tree/master-json-c It is work in progress, though. Lognormalizer works fine, but mmnormalize has not been updated yet. New version is somewhat slower than older versions used to be. In my tests it was ~40% slower. This slowdown is attributable to more complex memory management due to bigger allocations by json-c. Still, it should be much faster than mmnormalize with older liblognorm. Comments are greatly welcome. >> * b) there is no terminator until the end of the buffer >> >> > same problem. The broader the simple parsers are, the higher the chances > for false positives or much more backtracking (in the end-of-line case it's > just false positive). The core idea is to use (lots of) very special > parsers, and resort to generic ones only if there is no way around that. Char-to parser stops at a certain character. Therefore, only way to match is to have this characted after the field. If there is the literal characted after the field, it is safe to have the field "empty", I think. It should not even break any existing meaningful rules. >> Both break CSV parsing. >> >> > Isn't there a CSV parser already? > No, there is not. But it is just an example. -- Pavel Levshin 28.10.2013 20:01, Rainer Gerhards: > On Mon, Oct 28, 2013 at 3:42 PM, Pavel Levshin wrote: > >> Hello. >> >> Is it OK to discuss liblognorm here? >> >> > I think it's fine, but it may be a good idea to CC the lognorm list, there > may be one or two folks over there ;) > > >> This approach to log parsing seems attractive to me. In it's current >> state, though, it not very usable for highload, and if there is no high >> load, then one can use regexps to do the same. So I would like to extend >> the idea to something fitting our purposes, instead of writing custom >> parsing module. >> >> There are a few shortcomings now: >> >> 1. Liblognorm is using libee for parsing and event handling, then the >> event gets converted to json and imported to json-c structures. It is >> declared as inefficient. I'll do my own tests of how inefficient it is in >> reality. Then, what is preferred way of overcoming it? Liblognorm could be >> extended to support json-c natively, or it could present some callback >> interface to populate fields in mmnormalize. It is questionable if we >> should continue to use libee, then. Or libee could be rewritten to use >> json-c, maybe... >> > > That would probably require a much longer answer, but let me at least go > for a quick one. There is a lot of legacy with liblognorm and libee. libee > was thought to become the reference lib for Mitre's CEE effort, before it > begun to hibernate (to phrase it politely). Even worse, libee is written to > a much older spec, and is very bloated in many of its objects. The > long-term approach should probably to get rid of libee altogether, but > there are some other apps that depend on it, so we must be somewhat > carefully (Champ, any comments?). BTW: the same is true for libestr, which > was meant to be used as a common string-lib for CEE, as CEE initially > thought they would desperately need to support embedded NUL chars, > something that was later dropped (but still is part of libestr). > > Finally, json-c is probably even an interim solution for rsyslog. It is > quite generic, which also boils down to slower and memory hungry than > absolutely necessary. There has been thinking about replacing it when we > have time to do so (or fork as slimmer version). > > As a tactical solution, my preferrence would still be to port liblognorm to > work with native json-c objects. I think that would also clean up larger > parts of the code. > > >> 2. Liblognorm is unable to match last part of a string in some cases. >> There is no field type which could fit anything till the end of string. >> This quirk maybe arise from some ideology, but it makes impossible, for >> example, to parse common CSV format, unless last field fits some of >> predefined field types by accident. Currently, parsers are defined in >> libee, and there is no interface to add one, which presents us with a >> choice: extend libee or use own parsers. There can be other useful field >> types, as well. >> > This came up on the list before. I thought there were some "rest of line" > type of syntax, but I had no time checking that. Looks like it isn't. I > think it would be a useful thing to have, even though this may lead to some > problems during the parser run. > > >> For the latter, what is the reason under these two restrictions in char-to >> parser: >> >> It is considered a format error if >> * a) the to-be-parsed buffer is already positioned on the terminator >> character >> > don't remember exactly, but it for sure has to do with avoiding false > positives > > >> * b) there is no terminator until the end of the buffer >> >> > same problem. The broader the simple parsers are, the higher the chances > for false positives or much more backtracking (in the end-of-line case it's > just false positive). The core idea is to use (lots of) very special > parsers, and resort to generic ones only if there is no way around that. > > >> Both break CSV parsing. >> >> > Isn't there a CSV parser already? > > > HTH > Rainer > >> -- >> Pavel Levshin >> >> ______________________________**_________________ >> rsyslog mailing list >> http://lists.adiscon.net/**mailman/listinfo/rsyslog >> http://www.rsyslog.com/**professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. From rgerhards at hq.adiscon.com Thu Oct 31 22:02:02 2013 From: rgerhards at hq.adiscon.com (Rainer Gerhards) Date: Thu, 31 Oct 2013 22:02:02 +0100 Subject: [Lognorm] [rsyslog] liblognorm In-Reply-To: <52716335.9090505@levshin.spb.ru> References: <526E77CC.6040904@levshin.spb.ru> <52716335.9090505@levshin.spb.ru> Message-ID: Thats great news! Please bear with me a short time, i am right in the middle of that big rule engine refactoring. I am very interested in merging this as soon i have sufficient time to do it decently. Great work! Rainer Sent from phone, thus brief. Am 30.10.2013 20:51 schrieb "Pavel Levshin" : > > So, I have taken the opportunity and refactored liblognorm to use json-c > instead of libee. Some parts of libee now present in liblognorm, notably > field parsers and encoders. They were rewritten to get rid of libee data > structures. In the same time, many bugs were fixed, and many were > undoubtedly produced. > > Current state of the library can be seen here: > > https://github.com/flicker581/**liblognorm/tree/master-json-c > > It is work in progress, though. Lognormalizer works fine, but mmnormalize > has not been updated yet. New version is somewhat slower than older > versions used to be. In my tests it was ~40% slower. This slowdown is > attributable to more complex memory management due to bigger allocations by > json-c. Still, it should be much faster than mmnormalize with older > liblognorm. > > Comments are greatly welcome. > > * b) there is no terminator until the end of the buffer >>> >>> >>> same problem. The broader the simple parsers are, the higher the chances >> for false positives or much more backtracking (in the end-of-line case >> it's >> just false positive). The core idea is to use (lots of) very special >> parsers, and resort to generic ones only if there is no way around that. >> > > Char-to parser stops at a certain character. Therefore, only way to match > is to have this characted after the field. If there is the literal > characted after the field, it is safe to have the field "empty", I think. > It should not even break any existing meaningful rules. > > Both break CSV parsing. >>> >>> >>> Isn't there a CSV parser already? >> >> > No, there is not. But it is just an example. > > > -- > Pavel Levshin > > > 28.10.2013 20:01, Rainer Gerhards: > >> On Mon, Oct 28, 2013 at 3:42 PM, Pavel Levshin >> wrote: >> >> Hello. >>> >>> Is it OK to discuss liblognorm here? >>> >>> >>> I think it's fine, but it may be a good idea to CC the lognorm list, >> there >> may be one or two folks over there ;) >> >> >> This approach to log parsing seems attractive to me. In it's current >>> state, though, it not very usable for highload, and if there is no high >>> load, then one can use regexps to do the same. So I would like to extend >>> the idea to something fitting our purposes, instead of writing custom >>> parsing module. >>> >>> There are a few shortcomings now: >>> >>> 1. Liblognorm is using libee for parsing and event handling, then the >>> event gets converted to json and imported to json-c structures. It is >>> declared as inefficient. I'll do my own tests of how inefficient it is in >>> reality. Then, what is preferred way of overcoming it? Liblognorm could >>> be >>> extended to support json-c natively, or it could present some callback >>> interface to populate fields in mmnormalize. It is questionable if we >>> should continue to use libee, then. Or libee could be rewritten to use >>> json-c, maybe... >>> >>> >> That would probably require a much longer answer, but let me at least go >> for a quick one. There is a lot of legacy with liblognorm and libee. libee >> was thought to become the reference lib for Mitre's CEE effort, before it >> begun to hibernate (to phrase it politely). Even worse, libee is written >> to >> a much older spec, and is very bloated in many of its objects. The >> long-term approach should probably to get rid of libee altogether, but >> there are some other apps that depend on it, so we must be somewhat >> carefully (Champ, any comments?). BTW: the same is true for libestr, which >> was meant to be used as a common string-lib for CEE, as CEE initially >> thought they would desperately need to support embedded NUL chars, >> something that was later dropped (but still is part of libestr). >> >> Finally, json-c is probably even an interim solution for rsyslog. It is >> quite generic, which also boils down to slower and memory hungry than >> absolutely necessary. There has been thinking about replacing it when we >> have time to do so (or fork as slimmer version). >> >> As a tactical solution, my preferrence would still be to port liblognorm >> to >> work with native json-c objects. I think that would also clean up larger >> parts of the code. >> >> >> 2. Liblognorm is unable to match last part of a string in some cases. >>> There is no field type which could fit anything till the end of string. >>> This quirk maybe arise from some ideology, but it makes impossible, for >>> example, to parse common CSV format, unless last field fits some of >>> predefined field types by accident. Currently, parsers are defined in >>> libee, and there is no interface to add one, which presents us with a >>> choice: extend libee or use own parsers. There can be other useful field >>> types, as well. >>> >>> This came up on the list before. I thought there were some "rest of >> line" >> type of syntax, but I had no time checking that. Looks like it isn't. I >> think it would be a useful thing to have, even though this may lead to >> some >> problems during the parser run. >> >> >> For the latter, what is the reason under these two restrictions in >>> char-to >>> parser: >>> >>> It is considered a format error if >>> * a) the to-be-parsed buffer is already positioned on the terminator >>> character >>> >>> don't remember exactly, but it for sure has to do with avoiding false >> positives >> >> >> * b) there is no terminator until the end of the buffer >>> >>> >>> same problem. The broader the simple parsers are, the higher the chances >> for false positives or much more backtracking (in the end-of-line case >> it's >> just false positive). The core idea is to use (lots of) very special >> parsers, and resort to generic ones only if there is no way around that. >> >> >> Both break CSV parsing. >>> >>> >>> Isn't there a CSV parser already? >> >> >> HTH >> Rainer >> >> -- >>> Pavel Levshin >>> >>> ______________________________****_________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/****mailman/listinfo/rsyslog >>> >>> > >>> http://www.rsyslog.com/****professional-services/ >>> >>> > >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >>> ______________________________**_________________ >> rsyslog mailing list >> http://lists.adiscon.net/**mailman/listinfo/rsyslog >> http://www.rsyslog.com/**professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> > > ______________________________**_________________ > Lognorm mailing list > Lognorm at lists.adiscon.com > http://lists.adiscon.net/**mailman/listinfo/lognorm > -------------- next part -------------- An HTML attachment was scrubbed... URL: