Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2004-04-16 15:48:11

Jeremy Maitin-Shepard wrote:
>"Reece Dunn" writes:
> > [1] Storage And Representation

> > This is how the underlying string is stored (allocation and memory
> > policy) and how it is represented (which is governed by locale, but at
> > stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

>I am not sure what you mean by "governed by locale." I believe the

What I meant was that things like character identification, upper/lower case
conversion, etc. would be in the locale, although I did not express this too
well. I know there are issues associated with using locales and I have very
little experience with them to comment.

> > The storage can easily be represented as a container type, and so we
> > template
> > <
> > typename CharT,
> > template< typename T, class A > class Container = std::vector,
> > class AllocT = std::allocator< CharT >
> >>
> > class string_storage: public Container< CharT, AllocT >
> > {
> > };
> > Here, I have chosen std::vector as the standard storage policy, as this
> > the current storage policies; thus, basic_string< CharT, Traits > would
> > therefore be based on string_storage< CharT >.
>I do like this idea, although I think something like this might be
>template <encoding enc, class Container = ...> class unicode_string;
>"encoding" would be an enumeration type, such that enc would specify
>one of UTF-8, UTF-16, or UTF-32. This is, I would say, a more explicit
>way to specify the encoding that relying on the size of the type
>specified, and also it avoids problems in cases where the platform does
>not have distinct types for UTF-16 and UTF-32 code units (unlikely, I

That would be a better idea. You would need something like:

template< int >
class encoding_type{};

class encoding_type< utf8_enc >{ typedef char type; }
// ...

   encoding enc,
   template< typename T, class A = std::allocate< T > > class Container =
class unicode_string: public Container< encoding_type< enc >::type >

>One issue which would need to be dealt with is that while it seems
>necessary for some containers, such as a rope, to have direct access to
>the container, publicly inheriting this unicode_string from the
>container type means that the additions to the interface must be more

I don't get this. Surely the roped_vector, or whatever rope-like container
is used, will have an STL container interface like std::vector so you would
use them interchangeably. The unicode_string facilities would make use if
the insert/append functions, iterators, etc of the storage container to
implement their specific facilities.

Thus, all the rope internals would be handled by roped_vector (or whatever
the rope container is called), allowing you to use it like a std::vector, so
the user of the container would be removed from the internals. This is the
idea of having a Container as a template in the first place.

> > [2] Basic Type And Iteration

> > The basic representation is more complex, because now we are dealing
> > character boundaries (when dealing with UTF-8 and UTF-16 views of a
> > string). At this stage, combining characters and marks should not be
> > with, only complete characters.

> > The Unicode string should provide at least 3 types of iterator,
>regardless of
> > the internal representation (NOTE: as such, they will be implementation
> > dependant on how the string is represented):
> > * UTF-8 -- provides access to the UTF-8 representation of the string;
> > * UTF-16 -- provides access to the UTF-16 representation of the string;
> > * UTF-32 -- provides access to the Unicode character type.

>This seems reasonable, although I practice the UTF-32/code-point
>iterator would be the most likely to be used.

Agreed, but the others would be useful: writing the string to a file as an
example. This is why I suggest that the UTF-32 iterator is the default
iterator (i.e. unicode_string::iterator is a UTF-32 iterator).

> > As a side note, it should be feasible to provide specialist wrappers
> > existing Unicode libraries (like Win32 (CharNext, etc.), ICU and
>libiconv?), so
> > I would suggest having something akin to char_traits in basic_string.

>I would say there is not much need to provide "specialist wrappers"
>over other libraries. Presumably, a lot of a Boost Unicode library
>could use code from ICU, but there is no advantage in attempting to use
>platform-specific facilities, and doing so would surely introduce
>inefficiency and great complication.

Okay. It was just an idea :)

> > RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16
> > multi-character encodings of UTF-32 (not considering combining marks at
> > stage), whereas UTF-32 is a single character encoding.

>Yes, most processing would at the very least need to internally use
>this code-point iterator.

>There is a significant advantage, however, in standardizing on a single
>POD-array representation as well, such that much of the library need not
>be templated (and thus implemented largely in header files, and thus
>compiled each use), and less efficient functions could also be provided
>which allow arbitrary iterators through runtime polymorphism. I think
>it will be particularly important to examine this trade-off, because
>Unicode support involves a large number of heavy-weight facilities.

Agreed. However, this is contradictory to allowing the user to specify the
container used for string storage.

Maybe having a templatized version for users that want a custom storage
policy, like a rope, and a static representation (UTF-16?) for those that
are not bothered about how the unicode string is stored. The interfaces of
these should be the same to allow the higher-level facilities to
interoperate with both representations.

> > [3] Algorithms, Locales, etc.
> > These are build upon the UTF-32 view of the Unicode string, like the
> > algorithms in the Boost library. Therefore, instead of str.find(
> > "World" )), you would have find( str, unicode_string( "World" )).
>Well, except that you would want the strength, etc. to be adjustable,
>and of course localized, and string literals pose additional problems...

The logic behind this was for unicode_string to deal with navigating through
the internal represtentation and mapping to the internal representation. The
find functions, etc could then be implemented by iterating over the UTF-32
iterators and could be done as a template, e.g. string algorithms.

> > I would also suggest that there be another iterator that operates on
> > unicode_string::iterator, unicode_string::iterator > to group combining
> > etc.
>These code point groups are referred to as grapheme clusters, and I
>certainly agree that it is necessary to provide an iterator interface
>to grapheme clusters. I would not suggest, however, that normalization
>be integrated into that interface, because only a small portion of the
>possible grapheme clusters can be normalized into a single code point,
>and I don't think it is a particularly common operation to do so,
>especially for only a single grapheme cluster.

That makes sense. Going further into grapheme clusters would be too
complicated for a generic unicode library, as you would need to then
consider how to map the cluster into the appropriate font: this would be
platform specific and far too complex (e.g. overlaying combining marks,


It's fast, it's easy and it's free. Get MSN Messenger today!

Boost list run by bdawes at, gregod at, cpdaniel at, john at