Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-21 05:12:02


Mathew Robertson wrote:

>> >> - Why would the user want to change the encoding? Especially between
>> >> UTF-16 and UTF-32?
>> >
>> > Well... Different people have different needs. If you are mostly using
>> > ASCII characters, and require small size, UTF-8 would fit your bill. If
>> > you need the best general performance on most operations, use UTF-16.
>> > If you need fast iteration over code points and size doesn't matter,
>> > use UTF-32.
>>
>> Ok, since everybody agreed characters outside 16 bits are very rare,
>> UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need
>> for choice seems present. However, UTF-16 string class would be better
>> than no string class at all, and extra genericity will cost you
>> development time.
>
> <rant>
> umm... so your saying that no one will ever need more than 640K RAM?
> Just because YOU dont need more than 16bits, doesn't meen that I dont need
> more than 16bits. </rant>
>
> The main question of a Unicode library should _always_ be, can the library
> represent every character that can be drawn; things like iteraters,
> algorithms, etc are nice-to-haves -> the representation of the written
> language is the first priority, everything else is secondary.
>
> Also, the Unicode standard will evolve over time to include more
> characters from many more characters sets that you or I may never use but
> someone else might; who knows, maybe the ASCII character set will get a
> 27th character one day... A library shouldn't preclude the use of these
> new characters, just because we thought "no one will ever need more than
> 16bits"... So, how about we dont make the same mistakes as we made in the
> past...

Do you realize that "nobody needs UTF-32" is not the same that "nobody needs
character which can't be represented in 16 bits"? UTF-16 can represent all
Unicode characters.

> Whatever desision finially gets chosen will come down to one of two
> choices: a) variable length string format, eg: UTF8, or something similar
> b) fix width format with so many bits that humans are unlikely to use all
> the address space at any time in the next 50/100 years, eg UTF-32, or
> similar
>
> FWIW: my personal preference would be to go for a variable with encoding
> -> so that we never have to solve this problem again... although this
> makes concepts like text-reflow quite a bit harder to implement.

What's "text-reflow", BTW?

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk