Boost logo

Boost :

From: Peter Bindels (dascandy_at_[hidden])
Date: 2006-09-17 16:05:53


On 17/09/06, loufoque <mathias.gaunard_at_[hidden]> wrote:
> Peter Bindels wrote :
>
> > That's not entirely accurate. UTF-8 is Latin-centric, so that all
> > latin texts can be processed in linear time, taking longer for the
> > rest.
>
> Huh?
> Not really.
> All non ASCII characters, including latin ones, require more than one
> byte per character.

Ok, I'll come back on Latin, I intended to say, the Latin-section
represented in ASCII-7.

> > UTF-16 is common-centric, in that it works efficiently for all
> > common texts in all common scriptures, except for a few. Choosing
> > UTF-8 over UTF-16 would make the implementation (and accompanying
> > software) slow in all parts of the world that aren't solely using
> > Latin characters.
>
> I doubt the overhead is really noticeable.
> UTF-16 just makes validation and iteration a little simpler.

Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and
by the definition of the boundary between the base UTF-16 plane and
the higher plane you should treat all characters >0xFFFF (encoded with
two entries) as very irregular. You could then keep an array of
indexes where these characters appear in your string (adding a slight
bit to the overhead) making overhead constant-time except for the
occurrences of those characters. You cannot add this technique to
UTF-8 texts because non-7-bit characters are a lot more common.

Add to that that UTF-8 2-byte encoding only supports 13-bit entries.
That means that all characters from 0x2000...0xD7FF and
0xE000...0xFFFC use a byte more than they would in UTF-16. I checked
this, this includes about all of Asia, in particular including all
common Japanese and Chinese characters, as well as a number of Latin
extended characters. You can see the ranges of unicode characters in
the filenames of the links at:
http://www.unicode.org/charts/

> UTF-32 allows random access but that's rather useless since you need to
> iterate over the string anyway to handle combining characters.

That's a point I hadn't thought of. In that case, what advantages does
UTF-32 hold over any of the other two?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk