Boost logo

Boost :

From: Rogier van Dalen (R.C.van.Dalen_at_[hidden])
Date: 2004-04-17 07:18:52


Anthony Williams wrote:

> "Reece Dunn" <msclrhd_at_[hidden]> writes:
>>[2] Basic Type And Iteration
>>
>>The basic representation is more complex, because now we are dealing with
>>character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
>>string). At this stage, combining characters and marks should not be
>>concerned with, only complete characters.
>
>
> Here is the issue. What constitutes a complete character? At the lowest level,
> a single codepoint is a character. At the next level, a collection of
> codepoints (base+combining marks) is a character (e.g. e + acute accent is a
> single character). Sometimes there are many equivalent sequences of codepoints
> that constitute the same character. Sometimes there may be a single codepoint
> that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).
>
> At another level, a set of codepoints represents a glyph. This glyph may cover
> one or more characters. There may be several alternative glyphs for a single
> set of codepoints.

Yes, but there may be more glyphs for one codepoint as well. If your
definition of glyph is the same as mine (to me it has to do with
graphics rather than meaning), glyphs have nothing to do with Unicode
text handling, but rather with font drawing (AFAIK ICU deals with both).

[...]

>>I would also suggest that there be another iterator that operates on
>>std::pair< unicode_string::iterator, unicode_string::iterator > to group
>>combining marks, etc. Thus, there would also be a function
>>
>>unicode_string::utf32_t combine
>>(
>> std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
>>)
>>
>>that will map the range into a single code point. You could therefore have a
>>combined_iterator that will provide access to the sequence of combined
>>characters.
>
>
> You cannot always map the sequence of codepoints that make up a character into
> a single codepoint.
>
> However, it is agreed that it would be nice to have a means of dealing with
> "character" chunks, independently of the number of codepoints that make up
> that character.

Yes. It seems to me that the discussion so far is about storage, rather
than use of Unicode strings. One character may be defined by more than
one codepoint, and the different ways to define one character are
semantically equivalent (canonically equivalent, see the Unicode
standard, 3.7). So U+00E0 ("a with grave") is equivalent to U+0061
U+0300 ("a" "combining grave"). I think characters in this sense should
be at the heart of a usable Unicode string.

I would propose a class unicode_char, containing one or more codepoints
(e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char)
should return true for equivalent sequences. A Unicode string would be a
basic_string-like container of unicode_char's. The find_first_of and
such functions would then have the expected behaviour.

The implementation should probably be more optimised than requiring an
allocation for every character, but IMO a good Unicode library should
*transparently* deal with such things as canonical equivalence for all
operations, like searching, deleting characters, etcetera.
unicode_string should be as easy to use as basic_string.

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk