Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2004-04-16 08:25:24


"Reece Dunn" <msclrhd_at_[hidden]> writes:

> [1] Storage And Representation
>
> The storage can easily be represented as a container type, and so we have:
>
> template
> <
> typename CharT,
> template< typename T, class A > class Container = std::vector,
> class AllocT = std::allocator< CharT >
>>
> class string_storage: public Container< CharT, AllocT >
> {
> };

I am not sure this really gains us anything over just using the underlying
container directly.

> [2] Basic Type And Iteration
>
> The basic representation is more complex, because now we are dealing with
> character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
> string). At this stage, combining characters and marks should not be
> concerned with, only complete characters.

Here is the issue. What constitutes a complete character? At the lowest level,
a single codepoint is a character. At the next level, a collection of
codepoints (base+combining marks) is a character (e.g. e + acute accent is a
single character). Sometimes there are many equivalent sequences of codepoints
that constitute the same character. Sometimes there may be a single codepoint
that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).

At another level, a set of codepoints represents a glyph. This glyph may cover
one or more characters. There may be several alternative glyphs for a single
set of codepoints.

> The Unicode string should provide at least 3 types of iterator, regardless
> of the internal representation (NOTE: as such, they will be implementation
> dependant on how the string is represented):

> * UTF-8 -- provides access to the UTF-8 representation of the string;
> * UTF-16 -- provides access to the UTF-16 representation of the string;
> * UTF-32 -- provides access to the Unicode character type.

I agree we need conversions to/from all 3 formats.

> Therefore, no matter what the representation, it should be possible to use
> the UTF-32 iterator variant and "see" the string in native Unicode; this
> should, therefore, be the standard iterator and the others should be used
> when converting between formats.

That is my POV.

> NOTE: I am not well versed in how Unicode is represented, so I do not know
> how feasible it is to implement backwards traversal, but I do know that it
> would probably be wise to know the position of the last good end of a
> Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16
> representations).

Backwards traversal is generally possible, though with UTF-8 it is very slow,
as you don't know how many bytes there are until the beginning of the
character (though you know when you've got there).

> As a side note, it should be feasible to provide specialist wrappers around
> existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),

Agreed.

> so I would suggest having something akin to char_traits in basic_string.

I am not sure how that helps.

> RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are
> multi-character encodings of UTF-32 (not considering combining marks at this
> stage), whereas UTF-32 is a single character encoding.

Yes, that is why I believe we should use UTF-32 as the base (despite the
performance considerations others have raised).

> [3] Algorithms, Locales, etc.
>
> These are build upon the UTF-32 view of the Unicode string, like the string
> algorithms in the Boost library. Therefore, instead of str.find(
> unicode_string( "World" )), you would have find( str, unicode_string(
> "World" )).

I am not sure how non-member vs member makes any difference.

> I would also suggest that there be another iterator that operates on
> std::pair< unicode_string::iterator, unicode_string::iterator > to group
> combining marks, etc. Thus, there would also be a function
>
> unicode_string::utf32_t combine
> (
> std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
> )
>
> that will map the range into a single code point. You could therefore have a
> combined_iterator that will provide access to the sequence of combined
> characters.

You cannot always map the sequence of codepoints that make up a character into
a single codepoint.

However, it is agreed that it would be nice to have a means of dealing with
"character" chunks, independently of the number of codepoints that make up
that character.

Anthony

-- 
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk