Boost logo

Boost :

From: Rogier van Dalen (R.C.van.Dalen_at_[hidden])
Date: 2004-04-19 05:35:35


Anthony Williams wrote:
[..]
> Yes, but I was referring to more than just font differences. IIRC, there are
> examples in Arabic, where there are alternative representations of whole
> words, so the rendering engine does more than just translate characters to
> images, it may rearrange the characters, or treat groups of characters as a
> single item.
>
> But yes, in general it is beyond simple text handling, which is partly what I
> meant by "At another level".

Agreed. AFAIU, rearranging is done for text rendering only, which means
it would not at all be relevant for text handling, just like
right-to-left issues in mixed Latin/Arabic text, even for complex text
handling. Please correct me if I'm wrong.

[...]
>>I would propose a class unicode_char, containing one or more codepoints
>>(e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char)
>>should return true for equivalent sequences. A Unicode string would be a
>>basic_string-like container of unicode_char's. The find_first_of and such
>>functions would then have the expected behaviour.
>
>
> I think that actually storing character strings like that would be too slow,
> but you would certainly want an interface that dealt with such constructs,
> which is what I meant by '"character" chunks'.
>
>
>>The implementation should probably be more optimised than requiring an
>>allocation for every character, but IMO a good Unicode library should
>>*transparently* deal with such things as canonical equivalence for all
>>operations, like searching, deleting characters, etcetera. unicode_string
>>should be as easy to use as basic_string.
>
>
> Yes, ideally.

I would like to make my point slightly clearer than I did before. I
don't think it would do for a Unicode string library to concentrate on
code points. Yes, the raw Unicode data should be available somewhere, so
it can be written to file or sent to the OS's display routines. However,
IMO it should use characters as its *only* interface for manipulation.
The library should discourage using codepoints directly, because it will
lead to all kinds of errors that do not often appear in English text
manipulation but will for other languages. Think of such simple examples
as the equivalence of rôle and rôle in different normalisations.

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk