|
Boost : |
From: Graham (Graham_at_[hidden])
Date: 2008-03-10 16:10:29
>We can implement UTF-8's and UTF-16's skip_forward by looking at the
>current byte. But does that work with all encodings? I think it doesn't
>work for shift encodings, unless you're willing to come to a stop on a
>shift character. I'm not: there's a rule for some shift encodings that
>they *must* end in the initial shift state, which means that there's a
>good chance that a shift character is the last thing in the string.
This
>would mean, however, that if you increment an iterator that points to
>the last real character, it must scan past the shift character or it
>won't compare equal to the end iterator. Unless you're willing to scan
>past the shift in the equality test, another thing I wouldn't do.
>
>Seems to me that shift encodings are a lot more pain than they're
worth.
>I really have to wonder why anyone would ever have come up with them.
Sebastian,
As Unicode characters that are not in page zero can require more than 32
bits
to encode them [yes really] this means that one 'character' can be very
long
in UTF-8/16 encoding. It is even worse if you start looking at
conceptual
characters [graphemes] where you can easily have three characters make
up a
conceptual character.
The only way I have found of handling this is to base the string
functions
on a proper Unicode character support library according to the Unicode
spec.
This means that you need character movement support, grapheme support,
and
sorting support.
As I said to Phil, Rogier and I completed a Unicode character library
for
Release under boost, but never submitted it to Boost as we had intended
to
release it with a string library built on it, and never had time to do
the
second part of the work.
Yours,
Graham
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk