Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-19 21:55:51


On 01/19/2011 06:58 AM, Edward Diener wrote:
> ... elision by patrick...
>
> I do not believe that UTF-8 is the way to go. In fact I know it is
> not, except perhaps for the very near future for some programmers (
> Linux advocates ).
>
> Inevitably a Unicode standard will be adapted where every character of
> every language will be represented by a single fixed length number of
> bits. Nobody will care any longer that this fixed length set of bits
> "wastes space", as so many people today hysterically are fixated on.
> Whether or not UTF-32 can do this now or not I do not know but this
> world where a character in some language on earth is represented by
> some arcane multi-byte encoding will end. If UTF-32 can not do it then
> UTF-nn inevitably will.
UTF-32 is the only UCS fixed width encoding.

UTF-16 can encode most the basic multilingual plane in fixed width.
That's most the characters in the world. If you know your problem
domain, and know that you are in the first code plane then you can use
UTF-16 as a fixed width encoding. If you know that you have to be able
to handle any UCS character, then you can't. Currently 107,296 of the
characters in UCS are defined out of a total code space of 1,114,112, (0
to 10FFFF16).

>
>
> I do not think that shoving UTF-8 down everybody's throats is the best
> solution even now, I think a good set of classes to convert between
> encoding standards is much better.
I agree with you. Nobody should shove any one solution down anyone's
throat. Instead, I wish that more people would understand the
trade-offs of different encodings and when each might be more desirable
instead of saying, "Oh, we can never do that." or "Oh, we must always
do that." The best thing is to understand your problem domain, and what
the implications of that domain are in each of the possible encodings.

The truth is that the web and xml apps all use Unicode, as do more and
more applications. Nobody considers doing new international
applications with anything other than Unicode. That means that you need
to know about the three encodings, UTF-8 UTF-16 and UTF-32, and their
trade-offs. If you're on a fast lightly loaded machine with lots of
memory, there could be real advantages to UTF-32. If you're running on
a hand-held device with limited memory, UTF-8 could be a real winner.
That's a simplistic view of a complex decision, but if you're doing the
design for something you should educate yourself and make the complex
decision with fore thought.

You can get your own copy of the Unicode 5.2 standard as a zipped pdf
file at http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip

The 6.0 standard is being worked on as we speak.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk