Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-02-24 08:12:57


Frank Mori Hess wrote:
> I don't have a lot of experience using non-ascii strings in my
> internal code,
> aside from occasional forays into utf-8 for special characters, but
> wouldn't
> using ucs-4 for the "core" encoding be the sane thing to do? With a
> ucs-4
> encoding, you could use a
>
> basic_string<wchar_t>
>
> and continue using the familiar api without worrying about the
> complications
> and confusion caused by variable length encodings.
The sane thing, perhaps. But take a look at Mozilla, for example, who're
dealing with character data a lot. Currently they're evaluating the
memory and speed effects of switching from UTF-16 to UTF-8 for
everything. The reasoning is that even on web pages that consist mostly
of exotic characters, there's still a lot of ASCII around (not counting
tag names): URIs, IDs, classes, names, etc. Thus, the space savings
could be considerable. (Current benchmarks record an average of a few
percent on an unfortunately not representative set of pages, if I
remember correctly.)

Can you imagine what these developers would think of switching to
UTF-32, where 11 bits are guaranteed to be wasted simply because all
Unicode5 planes can be represented with 21 bits?

Sebastian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk