[Lognorm] Race condition in deletion of json-objects with 'event.tags'
singh.janmejay
singh.janmejay at gmail.com
Tue Dec 9 00:19:47 CET 2014
Sorry, I forgot to post that once I removed the tags from each rule,
segfaults stopped. It ran for more than 24 hours after which it was
restated deliberately for some deployment and config change etc.
I'll duplicate it and deploy new version and post what happens.
--
Regards,
Janmejay
PS: Please blame the typos in this mail on my phone's uncivilized soft
keyboard sporting it's not-so-smart-assist technology.
On Dec 9, 2014 4:20 AM, "Pavel Levshin" <pavel at levshin.spb.ru> wrote:
>
> Your analysis is most probably correct. As far as I remember, this object
> reuse was confusing to me. But there were no segfaults at the moment.
>
> Object reuse is intended to reduce memory allocations/frees, which have
> big impact on performance. On the other side, locks are expensive too. Even
> unsuitable for the case.
>
> Can you confirm that just duplication of event.tags solves your segfaults?
>
>
> --
> Pavel
>
>
> 07.12.2014 0:03, singh.janmejay:
>
> Little bit of an elaborate story.
>
> So, I was fighting a segfault for last few hours. It was happening in
> freeing-up of msg. Here is backtrace (it was the same in many core-dumps):
>
> (gdb) bt
> #0 0x0000000100000019 in ?? ()
> #1 0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at
> linkhash.c:116
> #2 0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22967bf50)
> at json_object.c:273
> #3 0x00007fa23ee9d536 in lh_table_free (t=0x7fa2009ac960) at
> linkhash.c:116
> #4 0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22dcd8a70)
> at json_object.c:273
> #5 0x000000000041bef7 in msgDestruct (ppThis=0x7fa2317f9b48) at msg.c:928
> #6 0x0000000000441bd4 in DeleteProcessedBatch (pThis=0xa7eb90,
> pBatch=0xac35a8) at queue.c:1589
> #7 0x0000000000441c9e in DequeueConsumableElements (pThis=0xa7eb90,
> pWti=0xac3570, piRemainingQueueSize=0x7fa2317f9bd8) at queue.c:1626
> #8 0x0000000000441e6d in DequeueConsumable (pThis=0xa7eb90,
> pWti=0xac3570) at queue.c:1680
> #9 0x000000000044226c in DequeueForConsumer (pThis=0xa7eb90,
> pWti=0xac3570) at queue.c:1821
> #10 0x0000000000442502 in ConsumerReg (pThis=0xa7eb90, pWti=0xac3570) at
> queue.c:1875
> #11 0x000000000043d95a in wtiWorker (pThis=0xac3570) at wti.c:334
> #12 0x000000000043c51f in wtpWorker (arg=0xac3570) at wtp.c:389
> #13 0x00007fa23f6b6b50 in start_thread (arg=<optimized out>) at
> pthread_create.c:304
> #14 0x00007fa23e7de7bd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #15 0x0000000000000000 in ?? ()
>
>
> Now, turns out, the thing lh_table_free is trying to free(#1), is
> basically an object with several keys, and whenever it failed, it failed on
> the last one. How do I know that?
>
> (gdb) up 1
> #1 0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at
> linkhash.c:116
> 116 linkhash.c: No such file or directory.
> (gdb) p *c
> $1 = {k = 0x7fa22dcd8a50, v = 0xa6bd20, next = 0x0, prev = 0x7fa22e93d490}
>
> And out of 2 rulebases(both being called for events of roughly the same
> frequency), every single segfault was for the rulebase that was using tags
> on rule.
>
> The rulebase in question had 2 rules, matching several fields, and both
> rules had tags applied.
>
> Note that ln_normalize applies event.tags only when tags are present,
> and insert of new key in json-c linked-hash-map goes in the end of the
> linked-list (check lh_table_insert).
>
> So, this key-value pair was supposed to be 'event.tags' and
> corresponding tags. The data that c was pointing to was garbled, so I
> couldn't tell what the key actually is, but I can conclude it was
> 'event.tags' based on size of objects. The garbling may be because the
> memory allocated for key was reused(note key is freed-up before value,
> check json_object_lh_entry_free). Considering it is allocated in a
> different thread (when message is parsed using mmnormalize), it is possible
> (as opposed to an object allocated in the same thread, using the same
> virtual-address range from one thread only).
>
> Now, what makes it a race condition? Not that event.tags object in
> ln_normalize is not duplicated, instead it is the same object with
> incrementing reference count. Not freeing routine does the counter check
> without locking or using atomic operation, check json_object_put.
>
> The only lock that msgDestruct has, is on msg->mut, which means
> different messages being processed at the same time can get that lock and
> concurrently try to free 'event.tags' object.
>
> With careful timing, some of them may concurrently decrement the counter
> and one of them may free the object. Now segfault can happen either because
> someone accesses it, or because someone tries to free it again etc.
>
> For me, the segfault was reproducing every few hours.
>
> The solution is simple, we need to duplicate the event.tags object
> instead of passing it on with incrementing reference count.
>
> I'll fix it, but before jumping to fix need another (few?) pairs of eyes
> to check correctness in reasoning (to point out any mistakes in analysis
> and ensure I haven't overlooked anything and misdiagnosed).
>
> Can someone please help validate/correct?
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>
>
> _______________________________________________
> Lognorm mailing listLognorm at lists.adiscon.comhttp://lists.adiscon.net/mailman/listinfo/lognorm
>
>
>
> _______________________________________________
> Lognorm mailing list
> Lognorm at lists.adiscon.com
> http://lists.adiscon.net/mailman/listinfo/lognorm
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.adiscon.net/pipermail/lognorm/attachments/20141209/a7b81476/attachment.html>
More information about the Lognorm
mailing list