[rsyslog] output plugin calling interface
Rainer Gerhards
rgerhards at hq.adiscon.com
Mon May 4 08:00:09 CEST 2009
Quickly just one response as I am preparing for the training I give...
> -----Original Message-----
> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> bounces at lists.adiscon.com] On Behalf Of Tom Metro
> Sent: Monday, May 04, 2009 7:31 AM
> To: rsyslog-users
> Subject: Re: [rsyslog] output plugin calling interface
>
> david at lang.hm wrote:
> > Tom Metro wrote:
> >> Rainer Gerhards wrote:
> >>> So even if we put everything into a database, RELP cannot rely on
> >>> that information to decide which message already have been received
> >>> and which not.
> >> I'm confused. On one side a receiver is talking RELP, and via RELP it
> >> receives a batch of messages, potentially containing duplicates. On the
> >> other side of that receiver is its storage back-end. If the receiver
> >> chooses, it ought to be able to query that storage to see if any of the
> >> messages are duplicates, and if so, discard them. This doesn't involve
> >> RELP. (I described an in-memory cache for efficiency reasons, but the
> >> duplicate check could involve querying a database.)
> >
> > it's not the right thing to just eliminate duplicate message. you may get
> > the same message multiple times (with the same timestamp even). the only
> > way to know if you have seen _this copy_ of the message before is to have
> > a unique identifier for the message.
>
> Your point nay be correct, but I'm not sure it has relevance to the
> material you quoted. The context of the above comments included Rainer
> saying, "RELP uses sequence numbers." So at least within the scope of a
> limited time window, the individual messages can be uniquely distinguished.
It's not a time window, it's a sliding window (much like eg TCP does) that
reflects the flow of messages. But at this point of the discussion, there is
not much difference between the two.
The problem, I think, that surfaces in the discussion is that you do not
properly think about the different layers. While rsyslogd is a single
application, it is internally store-and-forward and as such mimics the
infrastructure syslog uses in general. So think that shuffeling messages from
the input to the main queue is one complete "transaction". Shuffeling from
the main to the action queue is another one. Executing the action is the next
one, all within a single process space. However, you can easily extend that
view to remote peers.
Within relp, that is (omrelp -> network -> imrelp) we have sequence numbers.
But they are valid (and even exist) only in that context.
> > this unique identifier may not be something that's appropriate to store
> > (if it wasn't generated by the original sender, you may not want to pass
> > it on the the softwar that would be analysing the logs)
>
> Right. So for example, there might not be much sense in persistently
> storing a time-limited sequence number. But that didn't seem to be the
> point Rainer was making with regards to using a database back-end. A key
> comment he made was, "we have no universal predicate 'is stored'." And I
> was wondering why such functionality is required in order to avoid
> duplicates.
Think about the store-and-forward system. An analogy: can a mail client
provide a reliable delivery notification? No, because it does not deliver the
message. That does another entity. So, the ultimate destination may generate
a delivery report and it may send it back to you. But that's not part of the
original mail transaction but rather a new one. So the original mail client
does not have a predicate "is delivered".
In the same sense, rsyslog does not have a predicate "is stored". An input,
imrelp for example, does not even know if a message will eventually be
written to a database. Much less it knows how to use such database (assuming
it exists) to obtain knowledge about what was transmitted so far and what
not. What, for example, if the potentially-duplicate message is one that has
been discarded by the rule engine. So using any outcome of an action - two
logical hops away - as a state information for an input is unreliable and
IMHO as such unacceptable. So imrelp, if it intends to filter out duplicates,
must keep the state itself.
That, indeed, it is designed to do, but has not yet implemented. My overall
position here is that rsyslog today is much more reliable then the whole rest
of the syslog infrastructure, so there is no point in getting a tiny bit more
reliability here where it can not really be of help (an answer to the "pure
disk queue and high-volume sytems" question may change my position, this is
why I don't intend to explain that point any further until I get David's
results).
> > you may get the same message multiple times (with the same timestamp
> > even).
>
> Is that true even with a high-res time stamp? I suppose that's relative
> to the resolution of your time stamp and your message throughput.
As of the relevant standards, it is microsecond resolution at best. But that
depends on the resolution of the time source. I do not consider a timestamp
to be necessarily unique.
> To insure a hash of a message is unique, you'd probably have to include
> a sequence number in the data being hashed, in addition to the time
> stamp. Actually, timestamp + sequence number ought to provide a
> sufficiently unique ID for any message within a "conversation." The hash
> is probably of value only for obtaining something smaller to store or
> faster to look up (on the receiving side).
I think this is an old discussion and the only real solution is a uuid. I
don't see any point in re-inventing it (but generating uuid's takes time and
using them inside the syslog context requires standards).
Hope that helps, please continue to let the thoughts flow...
Rainer
>
> -Tom
>
>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
More information about the rsyslog
mailing list