Boost logo

Boost :

From: loufoque (mathias.gaunard_at_[hidden])
Date: 2006-09-17 14:20:15


Peter Bindels wrote :

> That's not entirely accurate. UTF-8 is Latin-centric, so that all
> latin texts can be processed in linear time, taking longer for the
> rest.

Huh?
Not really.
All non ASCII characters, including latin ones, require more than one
byte per character.

It can still be processed in linear time though, it just means you can't
have random access.

> UTF-16 is common-centric, in that it works efficiently for all
> common texts in all common scriptures, except for a few. Choosing
> UTF-8 over UTF-16 would make the implementation (and accompanying
> software) slow in all parts of the world that aren't solely using
> Latin characters.

I doubt the overhead is really noticeable.
UTF-16 just makes validation and iteration a little simpler.

> That would be most of Europe, Asia, Africa,
> South-America and a number of people in North-America and Australia.
> Forcing them to UTF-32 makes for quite a lot worse memory use than
> could reasonably be expected.

UTF-32 allows random access but that's rather useless since you need to
iterate over the string anyway to handle combining characters.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk