Boost :

Date view	Thread view	Subject view	Author view

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-20 08:00:57

Next message: Rogier van Dalen: "Re: [boost] Re: Any interest in adding unicode support to boost?"
Previous message: Joel: "Re: [boost] Re: Sprit-1.6 and latest boost"
In reply to: Peter Dimov: "Re: [boost] Re: Re: Any interest in adding unicode support to boost?"
Next in thread: Peter Dimov: "Re: [boost] Re: Re: Re: Any interest in adding unicode support to boost?"
Reply: Peter Dimov: "Re: [boost] Re: Re: Re: Any interest in adding unicode support to boost?"

Peter Dimov wrote:

> Vladimir Prus wrote:
>> Second question is if operator==, operator< or 'find' should operate
>> on vector<char_XX> or on abstract characters, using Unicode rules, or
>> there should be two versions. I don't really understand why
>> 'unicode-unaware' semantic is ever needed, so we should have only
>> 'unicode-aware' one.
>
> Look at 21.3/2: "The class template basic_string conforms to the
> requirements of a Sequence, as specified in (23.1.1).
>
> Additionally, because the iterators supported by basic_string are random
> access iterators (24.1.5), basic_string conforms to the the requirements
> of a Reversible Container, as specified in (23.1)."
>
> Now look at Table 65, Container requirements, operator==:
>
> "== is an equivalence relation.
>
> a.size()==b.size() && equal(a.begin(), a.end(), b.begin())"

Yes, I know that.

> The question is now, what do begin(), end() and size() return for our
> hypothetical string16?

First two return some hypothetical iterator. Its operator* will return
either "unicode_character" (base character + all accents) or
"unicode_character_ref" which will refer back to storage and extract
components on demand. The performance of both version is unclear. Further,
with "unicode_character_ref", the interator won't be lvalue iterator (or
random access iterator in standard terms).

> I maintain that the library design is much cleaner if begin(), end() and
> size() are random access iterators over the underlying _storage_, not over
> the codepoint representation or abstract character representation.

But those methods allow to work directly on storage, and I don't know if
that's ever needed, or more common that working on character level. And
after all, unicode_string can have 'storage()' method that gives
vector<char_16>&, if direct storage manipulation is desired.

> Codepoint iterators and abstract character iterators would still be
> provided, but they would be constant bidirectional with char32_t as the
> value_type.
>
> Codepoint and abstract character operations would be provided by
> algorithms, taking an iterator range.
>
> The user should remember and honor the encoding (UTF-16, UCS-2, other) of
> a particular container of char16_t, not the container itself.

So, vector<char16_t> could be either UTF-16 or UCS-2? I think that's a bad
idea.

If a library accepts unicode string, then its interface can either:
- use 'unicode_string'
- use 'unicode_string<some_encoding>'
- use 'vector<char16_t>' and have a comment that the string is UTF8.

I think the first option is best, and the last is too easy to misuse.

- Volodya

Next message: Rogier van Dalen: "Re: [boost] Re: Any interest in adding unicode support to boost?"
Previous message: Joel: "Re: [boost] Re: Sprit-1.6 and latest boost"
In reply to: Peter Dimov: "Re: [boost] Re: Re: Any interest in adding unicode support to boost?"
Next in thread: Peter Dimov: "Re: [boost] Re: Re: Re: Any interest in adding unicode support to boost?"
Reply: Peter Dimov: "Re: [boost] Re: Re: Re: Any interest in adding unicode support to boost?"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk