Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2004-04-16 08:25:24

"Reece Dunn" <msclrhd_at_[hidden]> writes:

> [1] Storage And Representation
> The storage can easily be represented as a container type, and so we have:
> template
> <
> typename CharT,
> template< typename T, class A > class Container = std::vector,
> class AllocT = std::allocator< CharT >
> class string_storage: public Container< CharT, AllocT >
> {
> };

I am not sure this really gains us anything over just using the underlying
container directly.

> [2] Basic Type And Iteration
> The basic representation is more complex, because now we are dealing with
> character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
> string). At this stage, combining characters and marks should not be
> concerned with, only complete characters.

Here is the issue. What constitutes a complete character? At the lowest level,
a single codepoint is a character. At the next level, a collection of
codepoints (base+combining marks) is a character (e.g. e + acute accent is a
single character). Sometimes there are many equivalent sequences of codepoints
that constitute the same character. Sometimes there may be a single codepoint
that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).

At another level, a set of codepoints represents a glyph. This glyph may cover
one or more characters. There may be several alternative glyphs for a single
set of codepoints.

> The Unicode string should provide at least 3 types of iterator, regardless
> of the internal representation (NOTE: as such, they will be implementation
> dependant on how the string is represented):

> * UTF-8 -- provides access to the UTF-8 representation of the string;
> * UTF-16 -- provides access to the UTF-16 representation of the string;
> * UTF-32 -- provides access to the Unicode character type.

I agree we need conversions to/from all 3 formats.

> Therefore, no matter what the representation, it should be possible to use
> the UTF-32 iterator variant and "see" the string in native Unicode; this
> should, therefore, be the standard iterator and the others should be used
> when converting between formats.

That is my POV.

> NOTE: I am not well versed in how Unicode is represented, so I do not know
> how feasible it is to implement backwards traversal, but I do know that it
> would probably be wise to know the position of the last good end of a
> Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16
> representations).

Backwards traversal is generally possible, though with UTF-8 it is very slow,
as you don't know how many bytes there are until the beginning of the
character (though you know when you've got there).

> As a side note, it should be feasible to provide specialist wrappers around
> existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),


> so I would suggest having something akin to char_traits in basic_string.

I am not sure how that helps.

> RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are
> multi-character encodings of UTF-32 (not considering combining marks at this
> stage), whereas UTF-32 is a single character encoding.

Yes, that is why I believe we should use UTF-32 as the base (despite the
performance considerations others have raised).

> [3] Algorithms, Locales, etc.
> These are build upon the UTF-32 view of the Unicode string, like the string
> algorithms in the Boost library. Therefore, instead of str.find(
> unicode_string( "World" )), you would have find( str, unicode_string(
> "World" )).

I am not sure how non-member vs member makes any difference.

> I would also suggest that there be another iterator that operates on
> std::pair< unicode_string::iterator, unicode_string::iterator > to group
> combining marks, etc. Thus, there would also be a function
> unicode_string::utf32_t combine
> (
> std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
> )
> that will map the range into a single code point. You could therefore have a
> combined_iterator that will provide access to the sequence of combined
> characters.

You cannot always map the sequence of codepoints that make up a character into
a single codepoint.

However, it is agreed that it would be nice to have a means of dealing with
"character" chunks, independently of the number of codepoints that make up
that character.


Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

Boost list run by bdawes at, gregod at, cpdaniel at, john at