Boost :

Date view	Thread view	Subject view	Author view

From: Aristid Breitkreuz (aribrei_at_[hidden])
Date: 2006-09-17 14:01:17

Next message: Robert Ramey: "Re: [boost] [serialization] multiple reg problem"
Previous message: Robert Ramey: "[boost] [MPI] Review comments"
In reply to: Peter Bindels: "Re: [boost] Work that has been done on Unicode"
Next in thread: loufoque: "Re: [boost] Work that has been done on Unicode"

Am Sonntag, den 17.09.2006, 18:10 +0200 schrieb Peter Bindels:
> On 17/09/06, Aristid Breitkreuz <aribrei_at_[hidden]> wrote:
> > Am Samstag, den 16.09.2006, 19:55 +0200 schrieb loufoque:
> > > Aristid Breitkreuz wrote :
> > [snip]
> > > > That's fine. Do you have plans on which Unicode encoding to use
> > > > internally?
> > >
> > > UTF-8, UTF-16 and UTF-32 would all be available for implementations, and
> > > each one would be able to take or give the other ones for input/output.
> >
> > I guess that every single supported type is extra complexity, right?
> > Would not UTF-8 (for brevity and compatibility) and UTF-32 (because it
> > might be better for some algorithms) suffice?
>
> That's not entirely accurate. UTF-8 is Latin-centric, so that all
> latin texts can be processed in linear time, taking longer for the
> rest.

I thought that for algorithmic processing, UTF-32 is optimal in most
cases?

> UTF-16 is common-centric, in that it works efficiently for all
> common texts in all common scriptures, except for a few.

This is some 90% space overhead for German / French / ... (European if
you want) texts. And 100% for English texts.

> Choosing
> UTF-8 over UTF-16 would make the implementation (and accompanying
> software) slow in all parts of the world that aren't solely using
> Latin characters.

Are you talking about memory overhead? AFAIK UTF-8 is quite good for
that. It might be slightly suboptimal for some Asian scripts but I'm not
sure about that. It is guaranteed that UTF-8 consumes never ever more
than 4 bytes.

> That would be most of Europe, Asia, Africa,
> South-America and a number of people in North-America and Australia.

Yes, those people (I am one of them) don't use solely Latin (=ASCII-7?)
characters. Still, I'd usually prefer UTF-8.

> Forcing them to UTF-32 makes for quite a lot worse memory use than
> could reasonably be expected. I see quite a lot of use for the UTF-16
> case, perhaps even more than the UTF-8 one.

UTF-32 is _always_ bad on memory. Because Unicode won't use more than I
think 21 bits ever. But UTF-32 is great for some algorithms. (OK, maybe
Unicode still has some traps hindering efficient UTF-32 algorithms, who
knows?)

> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Next message: Robert Ramey: "Re: [boost] [serialization] multiple reg problem"
Previous message: Robert Ramey: "[boost] [MPI] Review comments"
In reply to: Peter Bindels: "Re: [boost] Work that has been done on Unicode"
Next in thread: loufoque: "Re: [boost] Work that has been done on Unicode"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk