Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-06 14:11:13


In article <077097E85A6BD3119E910800062786A90B3F2D1C_at_[hidden]>,
 Ferdinand Prantl <ferdinand.prantl_at_[hidden]> wrote:

> I am afraid there is no universal solution for all users. The easiest
> solution is based on the native basic_string<>, which is specialized for
> char (8-bit) to support ASCII/ANSI encodings and for wchar_t (16-bit)
> usually used for UCS-2 encoded strings. UCS-4 (32-bit) encoding would
> require another basic_string<> specialization.

Assuming that basic_string<> is an appropriate abstraction for Unicode strings
is a fallacy; the rest of your post, and some other parts of this thread, seem
to make that assumption.

The reason that this is false is that basic_string<T> has performance guarantees
which make it incompatible with a useful Unicode string abstraction.

In order to maintain performance guarantees of basic_string, you have to treat a
Unicode string as a sequence of code points, rather than as a sequence of
abstract characters.

On the other hand, in order to manipulate a Unicode string without violating
constraints on well-formedness, you have to consider the string as a sequence of
abstract characters (unless, of course, you constrain yourself to string
transformations which operate on code point sequences yet guarantee that strings
remain well-formed; there are few such transformations -- concatenation is one
of them under certain constraints).

It should be noted that basic_string<ucs4char_t> is as misguided an idea as
basic_string<utf8char_t>, because even in UCS4 an abstract character might
consist of more than one code point; for example, if you consider the string

capital letter C; combining caron; lowercase letter e

it contains two abstract characters, but three UCS4 code points; therefore,
removing the first character from that string means removing the first two code
points of three. Removing just the first code point would leave you with a
combining caron followed by a lowercase letter e, which is not a well-formed
Unicode string.

(Yes, I know that this particular string could also be written in a canonically
precomposed form in which there is indeed one code point per abstract character,
but that is not true of all Unicode strings which include combining marks; I am
just too lazy to find out exactly which aren't.)

To summarize:

basic_string<ucs4char_t> solves very few problems compared to
basic_string<utf8char_t>. Do not be fooled into thinking that the complexities
of Unicode can be swept under the UCS4 rug.

basic_string is not the abstraction you are looking for, but it's also the only
one that is readily available in STL/boost today. It may serve as a good
starting point (questionable, IMNSHO), but it should most definitely not be
treated as the right thing to use for Unicode in the long term.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk