Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-21 01:09:19


Peter Dimov wrote:

>> If a library accepts unicode string, then its interface can either:
>> - use 'unicode_string'
>> - use 'unicode_string<some_encoding>'
>> - use 'vector<char16_t>' and have a comment that the string is UTF8.
>>
>> I think the first option is best, and the last is too easy to misuse.
>
> Yes.
>
> So let's see if I understand your position correctly.
>
> A single string class shall be used to store Unicode strings, i.e. logical
> sequences of Unicode abstract characters.
>
> This string shall be stored in one chosen encoding, for example UTF-8. The
> user does not have direct access to the underlying storage, however, so it
> might be regarded as an implementation detail.
>
> An invariant of the string is that it is always in one chosen normalized
> form. Iteration over the string gives back a sequence of char32_t abstract
> characters. Comparisons are defined in terms of these sequences.
>
> Is this a fair summary?

Yes, with this addition:

- user can obtain the raw data in any format he likes (local8bit, utf8,
utf16)
- user can construct the string from any format he likes (from the same
list)
- Ideally, there should be add "encoder" add-on, which can handle specific
named encodings ("koi8-r"...)

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk