Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Scott McMurray (me22.ca+boost_at_[hidden])
Date: 2009-05-15 02:31:47


On Wed, May 13, 2009 at 18:35, Mathias Gaunard
<mathias.gaunard_at_[hidden]> wrote:
> Phil Endecott wrote:
>> Some feedback based on that document:
>>
>> UTF-16
>> ....
>> This is the recommended encoding for dealing with Unicode.
>>
>> Recommended by who? It's not the encoding that I would normally
>> recommend.
>
> The Unicode standard, in some technical notes:
> http://www.unicode.org/notes/tn12/
> It recommends the use of UTF-16 for general purpose text processing.
>
> It also states that UTF-8 is good for compatibility and data exchange, and
> UTF-32 uses just too much memory and is thus quite a waste.
>

I really think UTF-8 should be the recommended one, since it forces
people to remember that it's no longer one unit, one "character".

Even in Beman Dawes's talk
(http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/filesystem.pdf)
where slide 11 mentions UTF-32 and remembers that UTF-16 can still
take 2 encoding units per codepoint, slide 13 says that UTF-16 is
"desired" where "random access critical".

What kind of real-world use do people have for random access, anyways?
Even UTF-32 isn't random access for the things I can think of that
people would care about, what with combining codepoints and ligatures
and other such things.

As an aside, I'd like to see comparisons between compressed UTF-8 and
compressed UTF-16, since neither one is random-access anyways, and it
seems to me that caring about size of text before compression is about
as important as the performance of a program with the optimizer turned
off.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk