Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2008-03-10 16:10:29


>We can implement UTF-8's and UTF-16's skip_forward by looking at the

>current byte. But does that work with all encodings? I think it doesn't

>work for shift encodings, unless you're willing to come to a stop on a

>shift character. I'm not: there's a rule for some shift encodings that

>they *must* end in the initial shift state, which means that there's a

>good chance that a shift character is the last thing in the string.
This

>would mean, however, that if you increment an iterator that points to

>the last real character, it must scan past the shift character or it

>won't compare equal to the end iterator. Unless you're willing to scan

>past the shift in the equality test, another thing I wouldn't do.

>

>Seems to me that shift encodings are a lot more pain than they're
worth.

>I really have to wonder why anyone would ever have come up with them.

 

Sebastian,

 

As Unicode characters that are not in page zero can require more than 32
bits

to encode them [yes really] this means that one 'character' can be very
long

in UTF-8/16 encoding. It is even worse if you start looking at
conceptual

characters [graphemes] where you can easily have three characters make
up a

conceptual character.

 

The only way I have found of handling this is to base the string
functions

on a proper Unicode character support library according to the Unicode
spec.

This means that you need character movement support, grapheme support,
and

sorting support.

 

As I said to Phil, Rogier and I completed a Unicode character library
for

Release under boost, but never submitted it to Boost as we had intended
to

release it with a string library built on it, and never had time to do
the

second part of the work.

 

Yours,

 

Graham

 


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk