Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2004-04-16 15:48:11


Jeremy Maitin-Shepard wrote:
>"Reece Dunn" writes:
> > [1] Storage And Representation

> > This is how the underlying string is stored (allocation and memory
>mapping
> > policy) and how it is represented (which is governed by locale, but at
>this
> > stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

>I am not sure what you mean by "governed by locale." I believe the

What I meant was that things like character identification, upper/lower case
conversion, etc. would be in the locale, although I did not express this too
well. I know there are issues associated with using locales and I have very
little experience with them to comment.

> > The storage can easily be represented as a container type, and so we
>have:
>
> > template
> > <
> > typename CharT,
> > template< typename T, class A > class Container = std::vector,
> > class AllocT = std::allocator< CharT >
> >>
> > class string_storage: public Container< CharT, AllocT >
> > {
> > };
>
> > Here, I have chosen std::vector as the standard storage policy, as this
>reflects
> > the current storage policies; thus, basic_string< CharT, Traits > would
> > therefore be based on string_storage< CharT >.
>
>I do like this idea, although I think something like this might be
>better:
>
>template <encoding enc, class Container = ...> class unicode_string;
>
>"encoding" would be an enumeration type, such that enc would specify
>one of UTF-8, UTF-16, or UTF-32. This is, I would say, a more explicit
>way to specify the encoding that relying on the size of the type
>specified, and also it avoids problems in cases where the platform does
>not have distinct types for UTF-16 and UTF-32 code units (unlikely, I
>admit).

That would be a better idea. You would need something like:

template< int >
class encoding_type{};

class encoding_type< utf8_enc >{ typedef char type; }
// ...

template
<
   encoding enc,
   template< typename T, class A = std::allocate< T > > class Container =
std::vector
>
class unicode_string: public Container< encoding_type< enc >::type >
{
   ...
};

>One issue which would need to be dealt with is that while it seems
>necessary for some containers, such as a rope, to have direct access to
>the container, publicly inheriting this unicode_string from the
>container type means that the additions to the interface must be more
>limited.

I don't get this. Surely the roped_vector, or whatever rope-like container
is used, will have an STL container interface like std::vector so you would
use them interchangeably. The unicode_string facilities would make use if
the insert/append functions, iterators, etc of the storage container to
implement their specific facilities.

Thus, all the rope internals would be handled by roped_vector (or whatever
the rope container is called), allowing you to use it like a std::vector, so
the user of the container would be removed from the internals. This is the
idea of having a Container as a template in the first place.

> > [2] Basic Type And Iteration

> > The basic representation is more complex, because now we are dealing
>with
> > character boundaries (when dealing with UTF-8 and UTF-16 views of a
>Unicode
> > string). At this stage, combining characters and marks should not be
>concerned
> > with, only complete characters.

> > The Unicode string should provide at least 3 types of iterator,
>regardless of
> > the internal representation (NOTE: as such, they will be implementation
> > dependant on how the string is represented):
> > * UTF-8 -- provides access to the UTF-8 representation of the string;
> > * UTF-16 -- provides access to the UTF-16 representation of the string;
> > * UTF-32 -- provides access to the Unicode character type.

>This seems reasonable, although I practice the UTF-32/code-point
>iterator would be the most likely to be used.

Agreed, but the others would be useful: writing the string to a file as an
example. This is why I suggest that the UTF-32 iterator is the default
iterator (i.e. unicode_string::iterator is a UTF-32 iterator).

> > As a side note, it should be feasible to provide specialist wrappers
>around
> > existing Unicode libraries (like Win32 (CharNext, etc.), ICU and
>libiconv?), so
> > I would suggest having something akin to char_traits in basic_string.

>I would say there is not much need to provide "specialist wrappers"
>over other libraries. Presumably, a lot of a Boost Unicode library
>could use code from ICU, but there is no advantage in attempting to use
>platform-specific facilities, and doing so would surely introduce
>inefficiency and great complication.

Okay. It was just an idea :)

> > RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16
>are
> > multi-character encodings of UTF-32 (not considering combining marks at
>this
> > stage), whereas UTF-32 is a single character encoding.

>Yes, most processing would at the very least need to internally use
>this code-point iterator.

>There is a significant advantage, however, in standardizing on a single
>POD-array representation as well, such that much of the library need not
>be templated (and thus implemented largely in header files, and thus
>compiled each use), and less efficient functions could also be provided
>which allow arbitrary iterators through runtime polymorphism. I think
>it will be particularly important to examine this trade-off, because
>Unicode support involves a large number of heavy-weight facilities.

Agreed. However, this is contradictory to allowing the user to specify the
container used for string storage.

Maybe having a templatized version for users that want a custom storage
policy, like a rope, and a static representation (UTF-16?) for those that
are not bothered about how the unicode string is stored. The interfaces of
these should be the same to allow the higher-level facilities to
interoperate with both representations.

> > [3] Algorithms, Locales, etc.
>
> > These are build upon the UTF-32 view of the Unicode string, like the
>string
> > algorithms in the Boost library. Therefore, instead of str.find(
>unicode_string(
> > "World" )), you would have find( str, unicode_string( "World" )).
>
>Well, except that you would want the strength, etc. to be adjustable,
>and of course localized, and string literals pose additional problems...

The logic behind this was for unicode_string to deal with navigating through
the internal represtentation and mapping to the internal representation. The
find functions, etc could then be implemented by iterating over the UTF-32
iterators and could be done as a template, e.g. string algorithms.

> > I would also suggest that there be another iterator that operates on
>std::pair<
> > unicode_string::iterator, unicode_string::iterator > to group combining
>marks,
> > etc.
>
>These code point groups are referred to as grapheme clusters, and I
>certainly agree that it is necessary to provide an iterator interface
>to grapheme clusters. I would not suggest, however, that normalization
>be integrated into that interface, because only a small portion of the
>possible grapheme clusters can be normalized into a single code point,
>and I don't think it is a particularly common operation to do so,
>especially for only a single grapheme cluster.

That makes sense. Going further into grapheme clusters would be too
complicated for a generic unicode library, as you would need to then
consider how to map the cluster into the appropriate font: this would be
platform specific and far too complex (e.g. overlaying combining marks,
etc.)!

Regards,
Reece

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today!
http://www.msn.co.uk/messenger


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk