Boost logo

Boost :

From: Peter Dimov (pdimov_at_[hidden])
Date: 2004-10-20 09:03:42


Vladimir Prus wrote:
> Peter Dimov wrote:
>
>> The question is now, what do begin(), end() and size() return for our
>> hypothetical string16?
>
> First two return some hypothetical iterator. Its operator* will return
> either "unicode_character" (base character + all accents) or
> "unicode_character_ref" which will refer back to storage and extract
> components on demand. The performance of both version is unclear.
> Further, with "unicode_character_ref", the interator won't be lvalue
> iterator (or random access iterator in standard terms).
>
>> I maintain that the library design is much cleaner if begin(), end()
>> and size() are random access iterators over the underlying
>> _storage_, not over the codepoint representation or abstract
>> character representation.
>
> But those methods allow to work directly on storage, and I don't know
> if that's ever needed, or more common that working on character
> level.

Well, only storage elements can be directly manipulated. So if any direct
manipulation is needed, it needs to be direct storage manipulation. ;-)

> And after all, unicode_string can have 'storage()' method that gives
> vector<char_16>&, if direct storage manipulation is desired.

I'm not sure that this is a good idea. It violates encapsulation and can
break the invariant of unicode_string, which, if I understand correctly, is
that it contains a sequence of abstract Unicode characters in a particular
pre-determined normalized form, encoded using a particular, pre-determined
encoding.

>> The user should remember and honor the encoding (UTF-16, UCS-2,
>> other) of a particular container of char16_t, not the container
>> itself.
>
> So, vector<char16_t> could be either UTF-16 or UCS-2? I think that's
> a bad idea.

Maybe, but this is just the way things are. :-) A sequence of char16_t can
have any encoding.

> If a library accepts unicode string, then its interface can either:
> - use 'unicode_string'
> - use 'unicode_string<some_encoding>'
> - use 'vector<char16_t>' and have a comment that the string is UTF8.
>
> I think the first option is best, and the last is too easy to misuse.

Yes.

So let's see if I understand your position correctly.

A single string class shall be used to store Unicode strings, i.e. logical
sequences of Unicode abstract characters.

This string shall be stored in one chosen encoding, for example UTF-8. The
user does not have direct access to the underlying storage, however, so it
might be regarded as an implementation detail.

An invariant of the string is that it is always in one chosen normalized
form. Iteration over the string gives back a sequence of char32_t abstract
characters. Comparisons are defined in terms of these sequences.

Is this a fair summary?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk