Boost logo

Boost :

From: Marshall Clow (marshall_at_[hidden])
Date: 2004-04-16 16:48:50


At 9:48 PM +0100 4/16/04, Reece Dunn wrote:
>Jeremy Maitin-Shepard wrote:
>>"Reece Dunn" writes:

[ big snip ]

>> > [2] Basic Type And Iteration
>
>> > The basic representation is more complex, because now we are dealing with
>> > character boundaries (when dealing with UTF-8 and UTF-16 views
>>of a Unicode
>> > string). At this stage, combining characters and marks should
>>not be concerned
>> > with, only complete characters.
>
>> > The Unicode string should provide at least 3 types of iterator,
>>regardless of
>> > the internal representation (NOTE: as such, they will be implementation
>> > dependant on how the string is represented):
>> > * UTF-8 -- provides access to the UTF-8 representation of the string;
>> > * UTF-16 -- provides access to the UTF-16 representation of the string;
>> > * UTF-32 -- provides access to the Unicode character type.
>
>>This seems reasonable, although I practice the UTF-32/code-point
>>iterator would be the most likely to be used.
>
>Agreed, but the others would be useful: writing the string to a file
>as an example. This is why I suggest that the UTF-32 iterator is the
>default iterator (i.e. unicode_string::iterator is a UTF-32
>iterator).

[ more snipped ]

>> > RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and
>>UTF-16 are
>>> multi-character encodings of UTF-32 (not considering combining
>>>marks at this
>>> stage), whereas UTF-32 is a single character encoding.

I'm pretty sure that this is a bad assumption.
You can't just ignore combining characters.

I believe that Miro posted an example of how (even using UTF-32), you
may not have a single character <<-->> single "entry" mapping.

-- 
-- Marshall
Marshall Clow     Idio Software   <mailto:marshall_at_[hidden]>
I want a machine that thinks I'm more important than it is, and acts like it.
-- Eric Herrmann

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk