Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2004-10-20 12:23:21

Peter Dimov wrote:
> Vladimir Prus wrote:
>> If a library accepts unicode string, then its interface can either:
>> - use 'unicode_string'
>> - use 'unicode_string<some_encoding>'
>> - use 'vector<char16_t>' and have a comment that the string is UTF8.
>> I think the first option is best, and the last is too easy to misuse.
> Yes.
> So let's see if I understand your position correctly.
> A single string class shall be used to store Unicode strings, i.e.
> logical sequences of Unicode abstract characters.
> This string shall be stored in one chosen encoding, for example UTF-8.
> The user does not have direct access to the underlying storage, however,
> so it might be regarded as an implementation detail.
> An invariant of the string is that it is always in one chosen normalized
> form. Iteration over the string gives back a sequence of char32_t
> abstract characters. Comparisons are defined in terms of these sequences.
> Is this a fair summary?

Such a one-size-fits-all unicode_string is guaranteed to be inefficient
for some applications. If it is always stored in a decomposed form, an
XML library probably wouldn't want to use it, because it requires a
composed form. And making the encoding an implementation detail makes it
inefficient to use in situations where binary compatibility matters
(serialization, for example).

Also, it is impossible to store an abstract unicode character in
char32_t because there may be N zero-width combining characters
associated with it.

Perhaps having a one-size-fits-all unicode_string might be a nice
default, as long as users who care about encoding and canonical form
have other types (template + policies?) with knobs they can twiddle.

Eric Niebler
Boost Consulting

Boost list run by bdawes at, gregod at, cpdaniel at, john at