Boost logo

Boost :

From: Mathew Robertson (mathew.robertson_at_[hidden])
Date: 2004-10-21 19:18:21


> >> >> - Why would the user want to change the encoding? Especially between
> >> >> UTF-16 and UTF-32?
> >> >
> >> > Well... Different people have different needs. If you are mostly using
> >> > ASCII characters, and require small size, UTF-8 would fit your bill. If
> >> > you need the best general performance on most operations, use UTF-16.
> >> > If you need fast iteration over code points and size doesn't matter,
> >> > use UTF-32.
> >>
> >> Ok, since everybody agreed characters outside 16 bits are very rare,
> >> UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need
> >> for choice seems present. However, UTF-16 string class would be better
> >> than no string class at all, and extra genericity will cost you
> >> development time.
> >
> > <rant>
> > umm... so your saying that no one will ever need more than 640K RAM?
> > Just because YOU dont need more than 16bits, doesn't meen that I dont need
> > more than 16bits. </rant>
> >
> > The main question of a Unicode library should _always_ be, can the library
> > represent every character that can be drawn; things like iteraters,
> > algorithms, etc are nice-to-haves -> the representation of the written
> > language is the first priority, everything else is secondary.
> >
> > Also, the Unicode standard will evolve over time to include more
> > characters from many more characters sets that you or I may never use but
> > someone else might; who knows, maybe the ASCII character set will get a
> > 27th character one day... A library shouldn't preclude the use of these
> > new characters, just because we thought "no one will ever need more than
> > 16bits"... So, how about we dont make the same mistakes as we made in the
> > past...
>
> Do you realize that "nobody needs UTF-32" is not the same that "nobody needs
> character which can't be represented in 16 bits"? UTF-16 can represent all
> Unicode characters.

yes I do realise... the origonal statement was "...everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed."
UTF-16 can indeed represent every Unicode character, but that is not what was written.

Also, "nobody needs character which can't be represented in 16 bits" in the context of UTF-16, is the same as "nobody needs more than 8 bits" if the context is UTF-8. The same could be said for 4bits and 2bits, given an appropriate encoding scheme...

One point that hasn't been mentioned so far is that, word sizes on most modern CPU's are 32bits wide. From a performance POV, the word-alignment may be a suitable justification for offsetting the increased storage requirements of a 32bit unit.

> > Whatever desision finially gets chosen will come down to one of two
> > choices: a) variable length string format, eg: UTF8, or something similar
> > b) fix width format with so many bits that humans are unlikely to use all
> > the address space at any time in the next 50/100 years, eg UTF-32, or
> > similar
> >
> > FWIW: my personal preference would be to go for a variable with encoding
> > -> so that we never have to solve this problem again... although this
> > makes concepts like text-reflow quite a bit harder to implement.
>
> What's "text-reflow", BTW?

text-reflow is the term used to describe what happens when a slab of text needs to be formatted to use a specified width.

For example, a wordprocessor (particularily one that uses variable width character font metrics), will need to reflow the paragraph so as to fit within the specified width. Say if you resize the wordprocessor window, the formatting engine would need to reflow the text according to the new window size.

Mathew


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk