Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2004-04-19 04:16:54


Rogier van Dalen <R.C.van.Dalen_at_[hidden]> writes:

> Anthony Williams wrote:
>
>> At another level, a set of codepoints represents a glyph. This glyph may
>> cover one or more characters. There may be several alternative glyphs for a
>> single set of codepoints.
>
> Yes, but there may be more glyphs for one codepoint as well. If your
> definition of glyph is the same as mine (to me it has to do with graphics
> rather than meaning), glyphs have nothing to do with Unicode text handling,
> but rather with font drawing (AFAIK ICU deals with both).

Yes, but I was referring to more than just font differences. IIRC, there are
examples in Arabic, where there are alternative representations of whole
words, so the rendering engine does more than just translate characters to
images, it may rearrange the characters, or treat groups of characters as a
single item.

But yes, in general it is beyond simple text handling, which is partly what I
meant by "At another level".

> [...]
>
>>>I would also suggest that there be another iterator that operates on
>>>std::pair< unicode_string::iterator, unicode_string::iterator > to group
>>>combining marks, etc. Thus, there would also be a function
>>>
>>>unicode_string::utf32_t combine
>>>(
>>> std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
>>>)
>>>
>>>that will map the range into a single code point. You could therefore have
>>>a combined_iterator that will provide access to the sequence of combined
>>>characters.
>> You cannot always map the sequence of codepoints that make up a character
>> into a single codepoint. However, it is agreed that it would be nice to
>> have a means of dealing with "character" chunks, independently of the
>> number of codepoints that make up that character.
>
> Yes. It seems to me that the discussion so far is about storage, rather than
> use of Unicode strings. One character may be defined by more than one
> codepoint, and the different ways to define one character are semantically
> equivalent (canonically equivalent, see the Unicode standard, 3.7). So
> U+00E0 ("a with grave") is equivalent to U+0061 U+0300 ("a" "combining
> grave"). I think characters in this sense should be at the heart of a usable
> Unicode string.
>
> I would propose a class unicode_char, containing one or more codepoints
> (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char)
> should return true for equivalent sequences. A Unicode string would be a
> basic_string-like container of unicode_char's. The find_first_of and such
> functions would then have the expected behaviour.

I think that actually storing character strings like that would be too slow,
but you would certainly want an interface that dealt with such constructs,
which is what I meant by '"character" chunks'.

> The implementation should probably be more optimised than requiring an
> allocation for every character, but IMO a good Unicode library should
> *transparently* deal with such things as canonical equivalence for all
> operations, like searching, deleting characters, etcetera. unicode_string
> should be as easy to use as basic_string.

Yes, ideally.

Anthony

-- 
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk