Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Beman Dawes (bdawes_at_[hidden])
Date: 2009-05-15 13:36:42


On Fri, May 15, 2009 at 2:31 AM, Scott McMurray <me22.ca+boost_at_[hidden]> wrote:
> On Wed, May 13, 2009 at 18:35, Mathias Gaunard
> <mathias.gaunard_at_[hidden]> wrote:
>> Phil Endecott wrote:
>>> Some feedback based on that document:
>>>
>>>    UTF-16
>>>    ....
>>>    This is the recommended encoding for dealing with Unicode.
>>>
>>> Recommended by who?  It's not the encoding that I would normally
>>> recommend.
>>
>> The Unicode standard, in some technical notes:
>> http://www.unicode.org/notes/tn12/
>> It recommends the use of UTF-16 for general purpose text processing.
>>
>> It also states that UTF-8 is good for compatibility and data exchange, and
>> UTF-32 uses just too much memory and is thus quite a waste.
>>
>
> I really think UTF-8 should be the recommended one, since it forces
> people to remember that it's no longer one unit, one "character".
>
> Even in Beman Dawes's talk
> (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/filesystem.pdf)
> where slide 11 mentions UTF-32 and remembers that UTF-16 can still
> take 2 encoding units per codepoint, slide 13 says that UTF-16 is
> "desired" where "random access critical".

It is really important to recognize that there isn't a single
recommended Unicode encoding. The most appropriate encoding can only
be chosen in relationship to a particular application and/or
algorithm.

UTF-8 and UTF-16 are both are in heavy use because they serve somewhat
different needs.

UTF-32 isn't used as often as those other two in strings, but I've
found it very useful for passing around single codepoints.

And then some needs change at runtime, so at least for strings an
adaptive encoding is needed.

> What kind of real-world use do people have for random access, anyways?
> Even UTF-32 isn't random access for the things I can think of that
> people would care about, what with combining codepoints and ligatures
> and other such things.

There are several related issues, assuming we are talking about
strings. Some operations are doable but uncommon, so the cost of doing
them should only be incurred if they are actually needed. Some
operations are unsafe without prior knowledge of the string contents,
but are perfectly safe with knowledge of the contents. Some operations
may be quite a bit cheaper in C++0x that C++03. etc., etc. It is hard
to talk in the abstract; we need to see the actual algorithms first.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk