Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2004-04-14 14:43:04


Anthony Williams <anthony_w.geo_at_[hidden]> writes:

> Miro Jurisic <macdev_at_[hidden]> writes:

>> [snip]

>> I have already gone over this in other posts, but, in short,
>> std::basic_string makes performance guarantees that are at odds with Unicode
>> strings.

> Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT
> rather than a typedef, so that one may specialize std::char_traits. Of course,
> if this gets standardized, then it can be a built-in, since the standard can
> specialize its own templates.

Performance guarantees aside, standardizing around UTF-32 is not, IMO,
practical.

> [snip]

>>> 3) define char_traits specialisations (as necessary) in order to get
>>> basic_string working with Unicode character sequences, typedef the
>>> appropriate string types:
>>>
>>> typedef basic_string<utf8_t> utf8_string; // etc
>>
>> This is not a good idea. If you do this, you will produce a basic_string
>> which can violate well-formedness of Unicode strings when you use any
>> mutation algorithm other than concatenation, or you will violate performance
>> guarantees of basic_string.

> Yes. basic_string<CharType> relies on each CharType being a valid entity in
> its own right --- for Unicode this means it must be a single Unicode code
> point, so using basic_string for UTF-8 is out.

basic_string can still be used as a low-level storage facility, although
it was certainly not designed to be used as such, and in treating it as
such, many of the ``compatibility'' advantages are lost anyway. If you
are advocating internal representation in UTF-32, however, I would say
that performance measurements generally show that UTF-16 is
significantly faster for processing, such that the advantage of being
able to nicely fit it into the existing interfaces is not justified.

> You are right that Unicode does not play fair with most standard locale
> facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity
> (which could be seen as many-many), locale specifics).

> Collation is one area where the standard library facilities should be OK,
> since the standard library collation support deals with whole strings. When
> you install the collation facet in your locale, you choose the Unicode
> collation options that are relevant to you.

Perhaps, except for the other issues which I have described in other
messages.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk