Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2004-04-16 05:30:58

Here are my thoughts on Unicode strings, based partially on the current
discussions of the topic. As I understand it, the problem with strings
(standard character and Unicode strings) can be broken down into several

[1] Storage And Representation

This is how the underlying string is stored (allocation and memory mapping
policy) and how it is represented (which is governed by locale, but at this
stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

The storage can easily be represented as a container type, and so we have:

   typename CharT,
   template< typename T, class A > class Container = std::vector,
   class AllocT = std::allocator< CharT >
class string_storage: public Container< CharT, AllocT >

Here, I have chosen std::vector as the standard storage policy, as this
reflects the current storage policies; thus, basic_string< CharT, Traits >
would therefore be based on string_storage< CharT >.

It would be easy, then, to select other representations like a reference
counted storage (a variant of std::auto_ptr< std::vector >) and even an
SGI-like rope! (Although, this would mean that a new std::roped_vector class
would need to be implemented: does such a thing already exist in Boost?)

[2] Basic Type And Iteration

The basic representation is more complex, because now we are dealing with
character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
string). At this stage, combining characters and marks should not be
concerned with, only complete characters.

The Unicode string should provide at least 3 types of iterator, regardless
of the internal representation (NOTE: as such, they will be implementation
dependant on how the string is represented):
* UTF-8 -- provides access to the UTF-8 representation of the string;
* UTF-16 -- provides access to the UTF-16 representation of the string;
* UTF-32 -- provides access to the Unicode character type.

Therefore, no matter what the representation, it should be possible to use
the UTF-32 iterator variant and "see" the string in native Unicode; this
should, therefore, be the standard iterator and the others should be used
when converting between formats.

NOTE: I am not well versed in how Unicode is represented, so I do not know
how feasible it is to implement backwards traversal, but I do know that it
would probably be wise to know the position of the last good end of a
Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16

As a side note, it should be feasible to provide specialist wrappers around
existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),
so I would suggest having something akin to char_traits in basic_string.

RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are
multi-character encodings of UTF-32 (not considering combining marks at this
stage), whereas UTF-32 is a single character encoding.

[3] Algorithms, Locales, etc.

These are build upon the UTF-32 view of the Unicode string, like the string
algorithms in the Boost library. Therefore, instead of str.find(
unicode_string( "World" )), you would have find( str, unicode_string(
"World" )).

I would also suggest that there be another iterator that operates on
std::pair< unicode_string::iterator, unicode_string::iterator > to group
combining marks, etc. Thus, there would also be a function

unicode_string::utf32_t combine
   std::pair< unicode_string::iterator, unicode_string::iterator > & ucr

that will map the range into a single code point. You could therefore have a
combined_iterator that will provide access to the sequence of combined


If ucr.first == ucr.second, then combine( ucr ) = *( ucr.first ).


Express yourself with cool new emoticons

Boost list run by bdawes at, gregod at, cpdaniel at, john at