Boost logo

Boost :

From: Aristid Breitkreuz (aribrei_at_[hidden])
Date: 2006-09-17 14:01:17


Am Sonntag, den 17.09.2006, 18:10 +0200 schrieb Peter Bindels:
> On 17/09/06, Aristid Breitkreuz <aribrei_at_[hidden]> wrote:
> > Am Samstag, den 16.09.2006, 19:55 +0200 schrieb loufoque:
> > > Aristid Breitkreuz wrote :
> > [snip]
> > > > That's fine. Do you have plans on which Unicode encoding to use
> > > > internally?
> > >
> > > UTF-8, UTF-16 and UTF-32 would all be available for implementations, and
> > > each one would be able to take or give the other ones for input/output.
> >
> > I guess that every single supported type is extra complexity, right?
> > Would not UTF-8 (for brevity and compatibility) and UTF-32 (because it
> > might be better for some algorithms) suffice?
>
> That's not entirely accurate. UTF-8 is Latin-centric, so that all
> latin texts can be processed in linear time, taking longer for the
> rest.

I thought that for algorithmic processing, UTF-32 is optimal in most
cases?

> UTF-16 is common-centric, in that it works efficiently for all
> common texts in all common scriptures, except for a few.

This is some 90% space overhead for German / French / ... (European if
you want) texts. And 100% for English texts.

> Choosing
> UTF-8 over UTF-16 would make the implementation (and accompanying
> software) slow in all parts of the world that aren't solely using
> Latin characters.

Are you talking about memory overhead? AFAIK UTF-8 is quite good for
that. It might be slightly suboptimal for some Asian scripts but I'm not
sure about that. It is guaranteed that UTF-8 consumes never ever more
than 4 bytes.

> That would be most of Europe, Asia, Africa,
> South-America and a number of people in North-America and Australia.

Yes, those people (I am one of them) don't use solely Latin (=ASCII-7?)
characters. Still, I'd usually prefer UTF-8.

> Forcing them to UTF-32 makes for quite a lot worse memory use than
> could reasonably be expected. I see quite a lot of use for the UTF-16
> case, perhaps even more than the UTF-8 one.

UTF-32 is _always_ bad on memory. Because Unicode won't use more than I
think 21 bits ever. But UTF-32 is great for some algorithms. (OK, maybe
Unicode still has some traps hindering efficient UTF-32 algorithms, who
knows?)

> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk