[rsyslog] Unicode & rsyslog - was: RE: PostgreSQL: Problems with character encoding
david at lang.hm
david at lang.hm
Mon Jan 25 09:42:32 CET 2010
On Mon, 25 Jan 2010, Rainer Gerhards wrote:
> David,
>
> we need to make a distinction between UTF, a transformation (and transfer)
> format and UCS, the actual native encoding format here. I think you mix these
> two things up. Unicode has two (primary) flavors, which are usually encoded
> in UCS-16 and UCS-32 (or ws it named UCS-2 and UCS-4 - guess so), being 2 and
> 4 bytes respectively. UCS-16 is what is implemented for example in Windows.
> It covers many of this worlds scripts, but has proven to not cover all, which
> caused additional code tables and UCS-32 presentation (at least as far as I
> know, I am not an Unicode expert ;)).
>
> UTF-8 is an encoding of Unicode code tables. You can think of it as
> traditional multi-byte character set which means each character takes up a
> varying number of bytes. Usually, UTF representations are converted into UCS
> and then UCS is used to do the processing. While UCS requires more bytes, UTF
> requires parsing of the message *each time* it is processed (e.g. to check
> for a string match, count character sizes, obtain a substring). So using UTF
> may use up fewer bytes, but can very considerably increase processing time
> need and program complexity. For US-ASCII, of course, this is no problem. But
> for other encodings, the performance hit can be very sever, much more than
> the hit by double memory consumption (UCS-2 is still being considered as
> "sufficient" for almost all cases, even in the future).
thanks for the clarification on terms. I had the basic understanding, but
not the exact terminology.
> So I don't think it would serve the non-US-ASCII world well to process the
> transformation formats. I guess that's a good option if you have a US-ASCII
> based system that only very occasionally needs to process a foreign language
> string (and even then, you need to parse the message *each* time you access
> it, specifically when obtaining substrings...).
>
> My conclusion is that rsyslog needs to do a UTF to UCS conversion on entry to
> the system and then uses UCS internally (and converts back when messages are
> output). Many software systems do so, and, as I said, IMHO do so for good
> reasons.
the question is how many different places/times are we parsing the data as
strings, vs how many places are we just moving the data around as
essentially opaque blobs.
when we receive and parse the message we have to deal with the data as
strings of characters, but this is generally done in one pass through the
input data, so it would be about the same to process the data as-is as to
convert it to UCS-2 (let alone then processing it as UCS-2). This pass can
calculate the number of characters in the string (i.e. 'length') and store
it
then these parsed chunks of data get copied around (in complex
configurations with many queues, they get copied around a LOT).
At some point (or points) comparisons are made, but in most cases these
comparisons can be done byte-by-byte, you don't actually have to parse the
data (for regex matches you do, and for contains you would have to check
the byte prior to the start of the match to make sure that that first
matching byte isn't the tail end of a prior character, but I think that's
it)
and then eventually we create the output string. At that point we are
assembling the string from the various substrings that we have stored
(which still can be treated as a series of bytes). It's only when the
property replacer is invoked with either character positions or options
that the data needs to be treated as a UTF-8 string instead of a series of
bytes again. Yes there are a lot of things that it can do, but how much
are they used in real life (other than setting a max length, which could
be special cased to not be checked if the number of bytes is less than
the length you are checking against)?
Remember that this is not general-purpose input and output that we are
dealing with, it's logs. And like it or not, most logs really are in
ASCII, simply because for so many years there was no option.
Also consider that the input and output stages can be split into multiple
worker threads, while the queue manipulation (and copying) is done inside
locks.
It may be best to leave the data as UTF-8 unless the property replacer has
been given options, and then let the property replacer convert the data,
work on it, and convert it back (if there is more than one option being
invoked)
David Lang
More information about the rsyslog
mailing list