Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-07 02:57:53


Miro Jurisic wrote:

> On the other hand, in order to manipulate a Unicode string without
> violating constraints on well-formedness, you have to consider the string
> as a sequence of abstract characters (unless, of course, you constrain
> yourself to string transformations which operate on code point sequences
> yet guarantee that strings remain well-formed; there are few such
> transformations -- concatenation is one of them under certain constraints).

[snip]

> capital letter C; combining caron; lowercase letter e
>
> it contains two abstract characters, but three UCS4 code points; therefore,
> removing the first character from that string means removing the first two
> code points of three. Removing just the first code point would leave you
> with a combining caron followed by a lowercase letter e, which is not a
> well-formed Unicode string.

Hi Miro,

so the point is that when using string-as-code-point-container, even searching
and removing a character/substring might get invalid string? E.g. even
looking for string 'foo' you theoretically can find string 'foo' followed by
composing character, and removing just 'foo' will be invalid?

> basic_string is not the abstraction you are looking for, but it's also the
> only one that is readily available in STL/boost today. It may serve as a
> good starting point (questionable, IMNSHO), but it should most definitely
> not be treated as the right thing to use for Unicode in the long term.

I wonder what's the right abstraction then? Is it necessary to have a class to
represent abstract character, with all composing characters?

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk