Boost logo

Boost :

From: Mathew Robertson (mathew.robertson_at_[hidden])
Date: 2004-10-21 04:46:10


> >> - Why would the user want to change the encoding? Especially between
> >> UTF-16 and UTF-32?
> >
> > Well... Different people have different needs. If you are mostly using
> > ASCII characters, and require small size, UTF-8 would fit your bill. If
> > you need the best general performance on most operations, use UTF-16. If
> > you need fast iteration over code points and size doesn't matter, use
> > UTF-32.
>
> Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32
> seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice
> seems present. However, UTF-16 string class would be better than no string
> class at all, and extra genericity will cost you development time.

<rant>
umm... so your saying that no one will ever need more than 640K RAM?
Just because YOU dont need more than 16bits, doesn't meen that I dont need more than 16bits.
</rant>

The main question of a Unicode library should _always_ be, can the library represent every character that can be drawn; things like iteraters, algorithms, etc are nice-to-haves -> the representation of the written language is the first priority, everything else is secondary.

Also, the Unicode standard will evolve over time to include more characters from many more characters sets that you or I may never use but someone else might; who knows, maybe the ASCII character set will get a 27th character one day... A library shouldn't preclude the use of these new characters, just because we thought "no one will ever need more than 16bits"... So, how about we dont make the same mistakes as we made in the past...

Whatever desision finially gets chosen will come down to one of two choices:
a) variable length string format, eg: UTF8, or something similar
b) fix width format with so many bits that humans are unlikely to use all the address space at any time in the next 50/100 years, eg UTF-32, or similar

FWIW: my personal preference would be to go for a variable with encoding -> so that we never have to solve this problem again... although this makes concepts like text-reflow quite a bit harder to implement.

regards,
Mathew Robertson


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk