Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-21 01:17:40


Eric Niebler wrote:

>> A single string class shall be used to store Unicode strings, i.e.
>> logical sequences of Unicode abstract characters.
>>
>> This string shall be stored in one chosen encoding, for example UTF-8.
>> The user does not have direct access to the underlying storage, however,
>> so it might be regarded as an implementation detail.
>>
>> An invariant of the string is that it is always in one chosen normalized
>> form. Iteration over the string gives back a sequence of char32_t
>> abstract characters. Comparisons are defined in terms of these sequences.
>>
>> Is this a fair summary?
>
>
> Such a one-size-fits-all unicode_string is guaranteed to be inefficient
> for some applications. If it is always stored in a decomposed form, an
> XML library probably wouldn't want to use it, because it requires a
> composed form. And making the encoding an implementation detail makes it
> inefficient to use in situations where binary compatibility matters
> (serialization, for example).

This seems right, but there's a catch. Configurable encoding would help if
all components of your application you the same encoding. Say XML parser
wants composed form, so you use unicode_string<utf16, composed>. Now
another part of your application (library written by somebody else) uses
different encoding, and you have to convert the data on the interface.

If there's only one encoding, you need to do conversion for code which
really, really needs other encoding. If there are several encoding, then
different libraries will use different encoding based on educated guesses
about data, and you'll be converting everywhere.

> Also, it is impossible to store an abstract unicode character in
> char32_t because there may be N zero-width combining characters
> associated with it.
>
> Perhaps having a one-size-fits-all unicode_string might be a nice
> default, as long as users who care about encoding and canonical form
> have other types (template + policies?) with knobs they can twiddle.

Maybe, I just wish there was some efficient mechanism to prevent users who
did not read the entire Unicode standard 10 times and so know what there's
doing to touch the knobs ;-)

- Volodya
 


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk