Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2004-04-16 14:05:43


"Reece Dunn" <msclrhd_at_[hidden]> writes:

> Here are my thoughts on Unicode strings, based partially on the current
> discussions of the topic. As I understand it, the problem with strings (standard
> character and Unicode strings) can be broken down into several stages:

> [1] Storage And Representation

> This is how the underlying string is stored (allocation and memory mapping
> policy) and how it is represented (which is governed by locale, but at this
> stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

I am not sure what you mean by "governed by locale." I believe the
intention is that platform-specific encodings be abandoned, and that
within this library, the internal representation of strings and
characters be one of the Unicode encodings.

> The storage can easily be represented as a container type, and so we have:

> template
> <
> typename CharT,
> template< typename T, class A > class Container = std::vector,
> class AllocT = std::allocator< CharT >
>>
> class string_storage: public Container< CharT, AllocT >
> {
> };

> Here, I have chosen std::vector as the standard storage policy, as this reflects
> the current storage policies; thus, basic_string< CharT, Traits > would
> therefore be based on string_storage< CharT >.

I do like this idea, although I think something like this might be
better:

template <encoding enc, class Container = ...> class unicode_string;

"encoding" would be an enumeration type, such that enc would specify
one of UTF-8, UTF-16, or UTF-32. This is, I would say, a more explicit
way to specify the encoding that relying on the size of the type
specified, and also it avoids problems in cases where the platform does
not have distinct types for UTF-16 and UTF-32 code units (unlikely, I
admit).

The purpose of unicode_string would be to wrap an existing container
with an encoding specifier and a few Unicode-specific operations, such
as appending a Unicode code point, or inserting from another unicode
string, and possibly adding some iterator typedefs.

One issue which would need to be dealt with is that while it seems
necessary for some containers, such as a rope, to have direct access to
the container, publicly inheriting this unicode_string from the
container type means that the additions to the interface must be more
limited.

> [snip: rope_vector in boost]

> [2] Basic Type And Iteration

> The basic representation is more complex, because now we are dealing with
> character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
> string). At this stage, combining characters and marks should not be concerned
> with, only complete characters.

> The Unicode string should provide at least 3 types of iterator, regardless of
> the internal representation (NOTE: as such, they will be implementation
> dependant on how the string is represented):
> * UTF-8 -- provides access to the UTF-8 representation of the string;
> * UTF-16 -- provides access to the UTF-16 representation of the string;
> * UTF-32 -- provides access to the Unicode character type.

This seems reasonable, although I practice the UTF-32/code-point
iterator would be the most likely to be used.

> [snip]

> As a side note, it should be feasible to provide specialist wrappers around
> existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), so
> I would suggest having something akin to char_traits in basic_string.

I would say there is not much need to provide "specialist wrappers"
over other libraries. Presumably, a lot of a Boost Unicode library
could use code from ICU, but there is no advantage in attempting to use
platform-specific facilities, and doing so would surely introduce
inefficiency and great complication.

> RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are
> multi-character encodings of UTF-32 (not considering combining marks at this
> stage), whereas UTF-32 is a single character encoding.

Yes, most processing would at the very least need to internally use
this code-point iterator.

There is a significant advantage, however, in standardizing on a single
POD-array representation as well, such that much of the library need not
be templated (and thus implemented largely in header files, and thus
compiled each use), and less efficient functions could also be provided
which allow arbitrary iterators through runtime polymorphism. I think
it will be particularly important to examine this trade-off, because
Unicode support involves a large number of heavy-weight facilities.

> [3] Algorithms, Locales, etc.

> These are build upon the UTF-32 view of the Unicode string, like the string
> algorithms in the Boost library. Therefore, instead of str.find( unicode_string(
> "World" )), you would have find( str, unicode_string( "World" )).

Well, except that you would want the strength, etc. to be adjustable,
and of course localized, and string literals pose additional problems...

> I would also suggest that there be another iterator that operates on std::pair<
> unicode_string::iterator, unicode_string::iterator > to group combining marks,
> etc.

These code point groups are referred to as grapheme clusters, and I
certainly agree that it is necessary to provide an iterator interface
to grapheme clusters. I would not suggest, however, that normalization
be integrated into that interface, because only a small portion of the
possible grapheme clusters can be normalized into a single code point,
and I don't think it is a particularly common operation to do so,
especially for only a single grapheme cluster.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk