Boost logo

Boost :

From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2007-06-20 20:56:41


Andrey Semashev wrote:

> UTF-8 is a variable character length encoding which complicates
> processing considerably.

It's trivial compared to the real Unicode work.

> I'd rather stick to UTF-16 if I had to use
> Unicode.

UTF-16 is a variable-length encoding too.

But anyway, Unicode itself is a variable-length format, even with the
UTF-32 encoding, simply because of grapheme clusters.

> I'm not saying that we don't need Unicode support. We do!
> I'm only saying that in many cases plain ASCII does its job perfectly
> well: logging, system messages, simple text formatting, texts in
> restricted character sets, like numbers, phone numbers, identifiers of
> all kinds, etc.

Identifiers of all kinds aren't text, they're just bytes.
As for logging, I'm not too sure whether it should be localized or not.
And I don't understand what you mean by system messages.

I still don't understand why you want to work with other character sets.
That will just require duplicating the tables and algorithms required to
process the text correctly.
See http://www.unicode.org/reports/tr10/ for an idea of the complexity
of collations, which allow comparison of strings.
As you can see, it has little to do with encoding, yet the tables etc.
require the usage of the Unicode character set, preferably in a
canonical form so that it can be quite efficient.

> There are cases where i18n is not needed at all - mostly
> server-side apps with minimal UI.

Any application that process or display non-trivial text (meaning
something else than options) should have internationalization.

> Being forced to use Unicode internally
> in these cases means increased memory footprint and degraded performance
> due to encoding translation overhead.

What encoding translation are you talking about?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk