Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-04-14 05:59:26


> > > - The standard facets (and the locale class itself, in that it is a
> > > functor for comparing basic_strings) are tied to facilities such as
> > > std::basic_string and std::ios_base which are not suitable for
> > > Unicode support.
> >
> > Why not? Once the locale facets are provided, the std iostreams will
"just
> > work", that was the whole point of templating them in the first place.
>
> I have already gone over this in other posts, but, in short,
std::basic_string
> makes performance guarantees that are at odds with Unicode strings.

Basic_string is a sequence of code points, no more no less, all performance
guarentees for basic_string can be met as such.

> > However I think we're getting ahead of ourselves here: I think a
Unicode
> > library should be handled in stages:
> >
> > 1) define the data types for 8/16/32 bit Unicode characters.
>
> The fact that you believe this is a reasonable first step leads me to
believe
> that you have not given much thought to the fact that even if you use a
32-bit
> Unicode encoding, a character can take up more than 32 bits (and likewise
for
> 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in
any
> encoding.

Well it is the same first step that ICU takes: there is also a proposal
before the C language committee to introduce such data types (they're called
char16_t and char32_t), C++ is likely to follow suite (see
http://std.dkuug.dk/jtc1/sc22/wg14/www/docs/n1040.pdf).

I'm talking about code-points (and sequences thereof), not characters or
glyphs which as you say consist of multiple code points.

I would handle "characters" and "glyphs" as iterator adapters sitting on top
of sequences of code points. For code points, basic_string is as good a
container as any (as are vector and deque and anything else you care to
define).

> > 2) define iterator adapters to convert a sequence of one Unicode
character
> > type to another.
>
> This is also not as easy as you seem to believe that it is, because even
within
> one encoding many strings can have multiple representations.

I'm not talking about normalision / composition here: just conversion
between encodings, ICU does this already, as do many other libraries.

Iterator adapters for normalisation / composition / compression would also
be useful additions.

Likewise adapters for iterating "characters" and "glyphs".

> > 3) define char_traits specialisations (as necessary) in order to get
> > basic_string working with Unicode character sequences, typedef the
> > appropriate string types:
> >
> > typedef basic_string<utf8_t> utf8_string; // etc
>
> This is not a good idea. If you do this, you will produce a basic_string
which
> can violate well-formedness of Unicode strings when you use any mutation
> algorithm other than concatenation, or you will violate performance
guarantees
> of basic_string.

Working on sequences of code points always requires care: clearly one could
erase a low surrogate and leave a high surrogate "orphanned" behind for
example. One would need to make it clear in the documention that potential
problems like this can occur.

> > 7) Anything I've forgotten :-)
>
> I think you have forgotten to read and understand the complexity of
Unicode (or
> any of the books that discuss the spec less tersely, such as Unicode
> Demystified), because I think that some of the suggestions you made here
are
> incompatible with how Unicode actually works. Please correct me if I am
wrong --
> I would love to be wrong :-)

Well sometimes I'm wrong, and sometimes I'm right ;-)

Unicode is such a large and complex issue, that it's actually pretty hard to
keep even a small fraction of the issues in ones mind at a time, hence my
suggestion to split the issue up into a series of steps.

> > The main goal would be to define a good clean interface, the
implementation
> > could be:
>
> We can't define a good clean interface until we understand the problems.

Obviously.

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk