Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-13 14:27:52


In article <00c101c42167$8d0a7f40$1b440352_at_fuji>,
 "John Maddock" <john_at_[hidden]> wrote:

> > - The standard facets (and the locale class itself, in that it is a
> > functor for comparing basic_strings) are tied to facilities such as
> > std::basic_string and std::ios_base which are not suitable for
> > Unicode support.
>
> Why not? Once the locale facets are provided, the std iostreams will "just
> work", that was the whole point of templating them in the first place.

I have already gone over this in other posts, but, in short, std::basic_string
makes performance guarantees that are at odds with Unicode strings.

> However I think we're getting ahead of ourselves here: I think a Unicode
> library should be handled in stages:
>
> 1) define the data types for 8/16/32 bit Unicode characters.

The fact that you believe this is a reasonable first step leads me to believe
that you have not given much thought to the fact that even if you use a 32-bit
Unicode encoding, a character can take up more than 32 bits (and likewise for
16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any
encoding.

> 2) define iterator adapters to convert a sequence of one Unicode character
> type to another.

This is also not as easy as you seem to believe that it is, because even within
one encoding many strings can have multiple representations.

> 3) define char_traits specialisations (as necessary) in order to get
> basic_string working with Unicode character sequences, typedef the
> appropriate string types:
>
> typedef basic_string<utf8_t> utf8_string; // etc

This is not a good idea. If you do this, you will produce a basic_string which
can violate well-formedness of Unicode strings when you use any mutation
algorithm other than concatenation, or you will violate performance guarantees
of basic_string.

> 7) Anything I've forgotten :-)

I think you have forgotten to read and understand the complexity of Unicode (or
any of the books that discuss the spec less tersely, such as Unicode
Demystified), because I think that some of the suggestions you made here are
incompatible with how Unicode actually works. Please correct me if I am wrong --
I would love to be wrong :-)

> The main goal would be to define a good clean interface, the implementation
> could be:

We can't define a good clean interface until we understand the problems.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk