Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-06 14:11:13

In article <077097E85A6BD3119E910800062786A90B3F2D1C_at_[hidden]>,
 Ferdinand Prantl <ferdinand.prantl_at_[hidden]> wrote:

> I am afraid there is no universal solution for all users. The easiest
> solution is based on the native basic_string<>, which is specialized for
> char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
> usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
> require another basic_string<> specialization.

Assuming that basic_string<> is an appropriate abstraction for Unicode strings
is a fallacy; the rest of your post, and some other parts of this thread, seem
to make that assumption.

The reason that this is false is that basic_string<T> has performance guarantees
which make it incompatible with a useful Unicode string abstraction.

In order to maintain performance guarantees of basic_string, you have to treat a
Unicode string as a sequence of code points, rather than as a sequence of
abstract characters.

On the other hand, in order to manipulate a Unicode string without violating
constraints on well-formedness, you have to consider the string as a sequence of
abstract characters (unless, of course, you constrain yourself to string
transformations which operate on code point sequences yet guarantee that strings
remain well-formed; there are few such transformations -- concatenation is one
of them under certain constraints).

It should be noted that basic_string<ucs4char_t> is as misguided an idea as
basic_string<utf8char_t>, because even in UCS4 an abstract character might
consist of more than one code point; for example, if you consider the string

capital letter C; combining caron; lowercase letter e

it contains two abstract characters, but three UCS4 code points; therefore,
removing the first character from that string means removing the first two code
points of three. Removing just the first code point would leave you with a
combining caron followed by a lowercase letter e, which is not a well-formed
Unicode string.

(Yes, I know that this particular string could also be written in a canonically
precomposed form in which there is indeed one code point per abstract character,
but that is not true of all Unicode strings which include combining marks; I am
just too lazy to find out exactly which aren't.)

To summarize:

basic_string<ucs4char_t> solves very few problems compared to
basic_string<utf8char_t>. Do not be fooled into thinking that the complexities
of Unicode can be swept under the UCS4 rug.

basic_string is not the abstraction you are looking for, but it's also the only
one that is readily available in STL/boost today. It may serve as a good
starting point (questionable, IMNSHO), but it should most definitely not be
treated as the right thing to use for Unicode in the long term.


If this message helped you, consider buying an item
from my wish list: <>

Boost list run by bdawes at, gregod at, cpdaniel at, john at