Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Graham (Graham_at_[hidden])
Date: 2009-02-14 18:49:48


Esben,

>I think you have gotten something mixed up. UTF-8 and UTF-32 (aka UCS4)
are
>just two encodings of the same character set, including the combining
you
>mentioned (which are really not that uncommon, e.g. m?l?e contains 2
>characters which could be written by combining glyphs. In practical
terms,
>UTF-32 is somewhat useless. (A case might be made for UTF-16, though)
>Kind regards, Esben

Having written both basic text editors and Unicode text editors, I can
say that if you are going Western Hemisphere then may be more efficient
to go UTF-8. If you stick to Unicode Code Page 0 then UTF-16 might be
appropriate if you have no formatting bits, but by the time you want to
do a full Unicode text editor you end up with [from memory] 21 or 22
bits of the UTF-32 encoding, and the remaining bits for your own
formatting info if you need it [font/ colour etc]. With surrogates, you
are still [very] slightly encoded in a 32 bit width, but this is a very
acceptable trade off for simplicity. In that sense UTF-32 is a misnomer
as it does not occupy a full 32 bits, but it is still an encoding !

Yours,

Graham


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk