Boost :

Date view	Thread view	Subject view	Author view

From: Peter Bindels (dascandy_at_[hidden])
Date: 2006-09-17 16:05:53

Next message: Kevin Sopp: "Re: [boost] [intrusive] Intrusive containers library strikes back"
Previous message: Douglas Gregor: "Re: [boost] [MPI] Review comments"
In reply to: loufoque: "Re: [boost] Work that has been done on Unicode"
Next in thread: loufoque: "Re: [boost] Work that has been done on Unicode"
Reply: loufoque: "Re: [boost] Work that has been done on Unicode"

On 17/09/06, loufoque <mathias.gaunard_at_[hidden]> wrote:
> Peter Bindels wrote :
>
> > That's not entirely accurate. UTF-8 is Latin-centric, so that all
> > latin texts can be processed in linear time, taking longer for the
> > rest.
>
> Huh?
> Not really.
> All non ASCII characters, including latin ones, require more than one
> byte per character.

Ok, I'll come back on Latin, I intended to say, the Latin-section
represented in ASCII-7.

> > UTF-16 is common-centric, in that it works efficiently for all
> > common texts in all common scriptures, except for a few. Choosing
> > UTF-8 over UTF-16 would make the implementation (and accompanying
> > software) slow in all parts of the world that aren't solely using
> > Latin characters.
>
> I doubt the overhead is really noticeable.
> UTF-16 just makes validation and iteration a little simpler.

Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and
by the definition of the boundary between the base UTF-16 plane and
the higher plane you should treat all characters >0xFFFF (encoded with
two entries) as very irregular. You could then keep an array of
indexes where these characters appear in your string (adding a slight
bit to the overhead) making overhead constant-time except for the
occurrences of those characters. You cannot add this technique to
UTF-8 texts because non-7-bit characters are a lot more common.

Add to that that UTF-8 2-byte encoding only supports 13-bit entries.
That means that all characters from 0x2000...0xD7FF and
0xE000...0xFFFC use a byte more than they would in UTF-16. I checked
this, this includes about all of Asia, in particular including all
common Japanese and Chinese characters, as well as a number of Latin
extended characters. You can see the ranges of unicode characters in
the filenames of the links at:
http://www.unicode.org/charts/

> UTF-32 allows random access but that's rather useless since you need to
> iterate over the string anyway to handle combining characters.

That's a point I hadn't thought of. In that case, what advantages does
UTF-32 hold over any of the other two?

Next message: Kevin Sopp: "Re: [boost] [intrusive] Intrusive containers library strikes back"
Previous message: Douglas Gregor: "Re: [boost] [MPI] Review comments"
In reply to: loufoque: "Re: [boost] Work that has been done on Unicode"
Next in thread: loufoque: "Re: [boost] Work that has been done on Unicode"
Reply: loufoque: "Re: [boost] Work that has been done on Unicode"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk