Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2004-04-14 07:06:36


Miro Jurisic <macdev_at_[hidden]> writes:

> In article <00c101c42167$8d0a7f40$1b440352_at_fuji>,
> "John Maddock" <john_at_[hidden]> wrote:
>
>> > - The standard facets (and the locale class itself, in that it is a
>> > functor for comparing basic_strings) are tied to facilities such as
>> > std::basic_string and std::ios_base which are not suitable for
>> > Unicode support.
>>
>> Why not? Once the locale facets are provided, the std iostreams will "just
>> work", that was the whole point of templating them in the first place.
>
> I have already gone over this in other posts, but, in short,
> std::basic_string makes performance guarantees that are at odds with Unicode
> strings.

Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT
rather than a typedef, so that one may specialize std::char_traits. Of course,
if this gets standardized, then it can be a built-in, since the standard can
specialize its own templates.

>> However I think we're getting ahead of ourselves here: I think a Unicode
>> library should be handled in stages:
>>
>> 1) define the data types for 8/16/32 bit Unicode characters.
>
> The fact that you believe this is a reasonable first step leads me to
> believe that you have not given much thought to the fact that even if you
> use a 32-bit Unicode encoding, a character can take up more than 32 bits
> (and likewise for 16-bit and 8-bit encodings. Unicode characters are not
> fixed-width data in any encoding.

Yes, but a codepoint is 32-bits, and codepoints can come in any sequence. A
given sequence of codepoints may or may not have a valid semantic meaning as a
"character", but that is like debating whether or not "fjkp" is a valid word
--- beyond the scope of basic string handling facilities.

>> 2) define iterator adapters to convert a sequence of one Unicode character
>> type to another.
>
> This is also not as easy as you seem to believe that it is, because even
> within one encoding many strings can have multiple representations.

That is why there are various canonical forms defined. We should provide a
means of converting to the canonical forms.

However, this is independent of Unicode encoding --- the same sequence of code
points can be represented in each Unicode encoding in precisely one way.

>> 3) define char_traits specialisations (as necessary) in order to get
>> basic_string working with Unicode character sequences, typedef the
>> appropriate string types:
>>
>> typedef basic_string<utf8_t> utf8_string; // etc
>
> This is not a good idea. If you do this, you will produce a basic_string
> which can violate well-formedness of Unicode strings when you use any
> mutation algorithm other than concatenation, or you will violate performance
> guarantees of basic_string.

Yes. basic_string<CharType> relies on each CharType being a valid entity in
its own right --- for Unicode this means it must be a single Unicode code
point, so using basic_string for UTF-8 is out.

You are right that Unicode does not play fair with most standard locale
facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity
(which could be seen as many-many), locale specifics).

Collation is one area where the standard library facilities should be OK,
since the standard library collation support deals with whole strings. When
you install the collation facet in your locale, you choose the Unicode
collation options that are relevant to you.

Anthony

-- 
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk