Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-20 08:00:57


Peter Dimov wrote:

> Vladimir Prus wrote:
>> Second question is if operator==, operator< or 'find' should operate
>> on vector<char_XX> or on abstract characters, using Unicode rules, or
>> there should be two versions. I don't really understand why
>> 'unicode-unaware' semantic is ever needed, so we should have only
>> 'unicode-aware' one.
>
> Look at 21.3/2: "The class template basic_string conforms to the
> requirements of a Sequence, as specified in (23.1.1).
>
> Additionally, because the iterators supported by basic_string are random
> access iterators (24.1.5), basic_string conforms to the the requirements
> of a Reversible Container, as specified in (23.1)."
>
> Now look at Table 65, Container requirements, operator==:
>
> "== is an equivalence relation.
>
> a.size()==b.size() && equal(a.begin(), a.end(), b.begin())"

Yes, I know that.

> The question is now, what do begin(), end() and size() return for our
> hypothetical string16?

First two return some hypothetical iterator. Its operator* will return
either "unicode_character" (base character + all accents) or
"unicode_character_ref" which will refer back to storage and extract
components on demand. The performance of both version is unclear. Further,
with "unicode_character_ref", the interator won't be lvalue iterator (or
random access iterator in standard terms).

> I maintain that the library design is much cleaner if begin(), end() and
> size() are random access iterators over the underlying _storage_, not over
> the codepoint representation or abstract character representation.

But those methods allow to work directly on storage, and I don't know if
that's ever needed, or more common that working on character level. And
after all, unicode_string can have 'storage()' method that gives
vector<char_16>&, if direct storage manipulation is desired.

> Codepoint iterators and abstract character iterators would still be
> provided, but they would be constant bidirectional with char32_t as the
> value_type.
>
> Codepoint and abstract character operations would be provided by
> algorithms, taking an iterator range.
>
> The user should remember and honor the encoding (UTF-16, UCS-2, other) of
> a particular container of char16_t, not the container itself.

So, vector<char16_t> could be either UTF-16 or UCS-2? I think that's a bad
idea.

If a library accepts unicode string, then its interface can either:
- use 'unicode_string'
- use 'unicode_string<some_encoding>'
- use 'vector<char16_t>' and have a comment that the string is UTF8.

I think the first option is best, and the last is too easy to misuse.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk