Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-04-14 05:59:26

> > > - The standard facets (and the locale class itself, in that it is a
> > > functor for comparing basic_strings) are tied to facilities such as
> > > std::basic_string and std::ios_base which are not suitable for
> > > Unicode support.
> >
> > Why not? Once the locale facets are provided, the std iostreams will
> > work", that was the whole point of templating them in the first place.
> I have already gone over this in other posts, but, in short,
> makes performance guarantees that are at odds with Unicode strings.

Basic_string is a sequence of code points, no more no less, all performance
guarentees for basic_string can be met as such.

> > However I think we're getting ahead of ourselves here: I think a
> > library should be handled in stages:
> >
> > 1) define the data types for 8/16/32 bit Unicode characters.
> The fact that you believe this is a reasonable first step leads me to
> that you have not given much thought to the fact that even if you use a
> Unicode encoding, a character can take up more than 32 bits (and likewise
> 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in
> encoding.

Well it is the same first step that ICU takes: there is also a proposal
before the C language committee to introduce such data types (they're called
char16_t and char32_t), C++ is likely to follow suite (see

I'm talking about code-points (and sequences thereof), not characters or
glyphs which as you say consist of multiple code points.

I would handle "characters" and "glyphs" as iterator adapters sitting on top
of sequences of code points. For code points, basic_string is as good a
container as any (as are vector and deque and anything else you care to

> > 2) define iterator adapters to convert a sequence of one Unicode
> > type to another.
> This is also not as easy as you seem to believe that it is, because even
> one encoding many strings can have multiple representations.

I'm not talking about normalision / composition here: just conversion
between encodings, ICU does this already, as do many other libraries.

Iterator adapters for normalisation / composition / compression would also
be useful additions.

Likewise adapters for iterating "characters" and "glyphs".

> > 3) define char_traits specialisations (as necessary) in order to get
> > basic_string working with Unicode character sequences, typedef the
> > appropriate string types:
> >
> > typedef basic_string<utf8_t> utf8_string; // etc
> This is not a good idea. If you do this, you will produce a basic_string
> can violate well-formedness of Unicode strings when you use any mutation
> algorithm other than concatenation, or you will violate performance
> of basic_string.

Working on sequences of code points always requires care: clearly one could
erase a low surrogate and leave a high surrogate "orphanned" behind for
example. One would need to make it clear in the documention that potential
problems like this can occur.

> > 7) Anything I've forgotten :-)
> I think you have forgotten to read and understand the complexity of
Unicode (or
> any of the books that discuss the spec less tersely, such as Unicode
> Demystified), because I think that some of the suggestions you made here
> incompatible with how Unicode actually works. Please correct me if I am
wrong --
> I would love to be wrong :-)

Well sometimes I'm wrong, and sometimes I'm right ;-)

Unicode is such a large and complex issue, that it's actually pretty hard to
keep even a small fraction of the issues in ones mind at a time, hence my
suggestion to split the issue up into a series of steps.

> > The main goal would be to define a good clean interface, the
> > could be:
> We can't define a good clean interface until we understand the problems.



Boost list run by bdawes at, gregod at, cpdaniel at, john at