[rsyslog] abort cases
Rainer Gerhards
rgerhards at hq.adiscon.com
Thu Sep 18 09:23:11 CEST 2008
On Wed, 2008-09-17 at 19:36 +0200, Lorenzo M. Catucci wrote:
> >> is still running, while it usually didn't last more than 12 hours, it
> >> seems the timing variation does somewhat cure the symptoms.
> >
> > This is what I am thinking, too. I am running it under helgrind. If you
> > do that, you'll see a couple of warnings, most have been found to be
> > harmless by previous analysis. Anyhow, I am re-doing the analysis now
> > (what takes quite a bit of time). I am just telling you so that you
> > don't wonder when you see so many warnings. I think there is even a
> > discussion thread somewhere telling why this is. Mostly it is cases
> > where we do not really need sync... or so I thought ;)
> >
>
> Just to help me understand, do you think val-grinding this time is
> near-to-useless, and I'd better restart the daemon under helgrind, or you
> prefer I continue valgrinding while you helgrind?
>From a pragmatical POV: if running it under valgrind helps ease the
immediate problem, please do that. I suspect valgrind will not report
anything, but if it does, that would probably be very interesting.
Using valgrind's helgrind tool does not make sense to you, because it
emits lots of message, which need to be interpreted by someone with deep
knowledge of the code. This is what I am looking at.
> I feared uttering the words "race condition" would have been doing just
> like the patient telling the doctor what he does want to hear as a
> dyagnosis... now that I see you are looking for missing syncs I think we
> share this gut feeling...
Yes, definitely. There is also one not directly technical reason that
makes me believe this: we had almost no serious issue with rsyslog since
the highly parallel multi-threading engine was introduced. I was a bit
astonished, because doing such a complex beast absolutely correctly in
the first place - even with the lots of testing it received in my
environment - is something I only very seldomly hear about. Now, out of
the sudden, multiple bug reports, all pointing into the same direction
come in. I now conclude that many folks were hesitant to actually use
the new version and now that time has passed, deployments into high
demand environments begin. Unfortunately, I do not have funding for the
high end machines and the amount of time required to do all these high
end testing after the release was finished (and even if I had, I'd
probably still not seen one issue or the other). So it sounds somewhat
logical to me that we have now begun to actually do the firedrill for
the new highly parallel processing part. And that, of course, points
into issues with thread execution order, aka race conditions ;) This
direction would also explain why the issues did not come up earlier
(with reasoning given above).
> Please let me know if I can help more, since
> I'm somewhat in the hope this dreaded shared memory 8 way system could
> very well shake the races... As a matter of fact, the destination server
> is a twin-brother of the source one, and is running (and logging to
> postgresql) without any hiccup since started-up.
All you can do at this time is being patient. I am reviewing the code.
The worker thread pools and all this logic is highly complex. I must
make sure that I have again a very tight grip on it, review all the
subtle cases I came along and then look at the helgrind output and other
diagnostic sources to see where the issue is. Probably then it would be
useful to have a gdb backtrace of an aborted process (or the more the
better), but my experience with these kinds of problems is that good
analysis is more likely to solve them than any captured real-time data.
As soon as I have a question, I'll post. Should you notice something
that you find interesting, please post. Ignoring something is easy, not
knowing something that may potentially help would be bad ;)
Rainer
More information about the rsyslog
mailing list