Boost logo

Boost :

From: Jonathan Biggar (jon_at_[hidden])
Date: 2005-03-19 11:57:55


Rogier van Dalen wrote:
>>Be careful with making a global assertion. Different users of a Unicode
>>library will need to access the data at different levels. Some will
>>need the raw encoding bytes or words, some will need code points, and
>>some will need 'grapheme clusters'.
>>
>>The library should support working at the level that each particular
>>user needs, and different parts of an application or library may need to
>>work at multiple levels.
>
> A decision must be made. Certainly you should have access to code
> points; and you should be able to work at multiple levels. However,
> one level has to be the default level. Most programmers should be able
> to get what they want by using boost::unicode_string (or whatever it's
> going to be called). We need to make a "global assertion" that's
> correct 99% of the time.

I don't see why there has to be a "default" inteface at all. There
should just be multiple interfaces, one for each level that a programmer
may have need to work at.

> I think we need an interface that will work for programmers that have
> no idea what the difference between a code point or a grapheme cluster
> is, and don't want to be bothered by the difference between
>
> U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX
> and
> U+006A LATIN SMALL LETTER J
> U+0302 COMBINING CIRCUMFLEX ACCENT

That's fine for *certain* uses. Other programs may have a need to
distinguish between the two, and need the ability to convert a Unicode
string from the form where all combining characters are combined and the
form where they are all separate explicit codepoints. A way of telling
the library that you don't care about the difference is to ensure that
every string you use is canonicalized into the form that makes your job
easier.

Alternatively, the interface could provide the ability to set state bits
in the string that indicate whether you want to see the differences or not.

> String handling includes searching, comparing, for which the above
> should be equivalent. As a programmer, I don't want to be bothered
> with different sequences that are canonically equivalent. I want it to
> just work. The library should handle the cases I didn't think about.

That's fine *when* you are working at that high a level of abstraction.

> Input and output has to deal with code points, obviously, but I think
> going from code points to what users think of as "characters" and vice
> versa for I/O should be done by the library. By default.
> I have not been able to find another use case for accessing code
> points directly. I'm ready to be convinced I'm wrong. However, we'll
> have to make a choice.

Another use case would be writing codeset conversion functions.

-- 
Jonathan Biggar
jon_at_[hidden]

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk