Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2004-04-13 09:56:47


> It seems that Unicode support in Boost (which could lead to Unicode
> support in the C++ language and standard library) would be quite
> desirable.

You bet, I would love to add much implroved Unicode support to Boost.Regex
(the issue is being raised by users more and more often), but need a
"standard" library on which to base it. I was planning to raise this issue
myself later in the summer, but if you want to take it on that's one less
thing to worry about :-)

> The IBM International Components for Unicode (ICU) library
> (http://oss.software.ibm.com/icu/) is an existing C++ library with what
> appears to be a Boost-compatible license, which provides all or most of
> the Unicode support that would be desired in Boost or the C++ standard
> library, in addition to Unicode-equivalents of libraries already either
> in the standard library or in Boost, including number/currency
> formatting, date formatting, message formatting, and a regular
> expression library. Unfortunately, it does not use C++ exceptions to
> signal exceptional conditions (but rather it uses an error code return
> mechanism), it does not follow Boost naming conventions, and although
> there are some C++-specific facilities, most of the C++ API is the same
> as the C API, thus resulting in a less-than-optimal C++ interface.

Agreed.

> Nonetheless, I think Boostifying the ICU library would be quite
> feasible, whereas attempting to reimplement all of the desired
> functionality that the ICU library provides would be extremely
> time consuming, since the collating and other services in the ICU
> library already support a large number of locales, and the
> character-set conversion facilities support a large number of character
> sets.
>
> The representation of locales does present an issue that needs to be
> considered. The existing C++ standard locale facets are not very
> suitable for a variety of reasons:
>
> - The standard facets (and the locale class itself, in that it is a
> functor for comparing basic_strings) are tied to facilities such as
> std::basic_string and std::ios_base which are not suitable for
> Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just
work", that was the whole point of templating them in the first place.

> - The interface of std::collate<Ch> is not at all suitable for
> providing all of the functionality desired for Unicode string
> collation.

There may be problems with other facets, but not with this one IMO, Unicode
provides a well defined algorithm for creating a sort key from a Unicode
string, and that's exactly the facility that std::collate needs (for
transform, the other mether methods can then be implemented in terms of
that).

> A suitable Unicode collation facility should at least
> allow for user-selection of the strength level used (refer to
> http://www.unicode.org/unicode/reports/tr10/), and would ideally
> also support customizations as extensive as the ICU library does
> (refer to
>
http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html
> and
> http://oss.software.ibm.com/icu/userguide/Collate_Customization.html).

Complicated stuff!

Normally the features that you are describing would be handled by a named
collate facet:

template<class charT>
class unicode_collate_byname : public std::collate<charT>
{
    unicode_collate_byname(const char* locale_name);
    /* details */
};

When the user imbues their locale with a unicode_collate_byname("en_GB"),
then they would expect it to "do the right thing". Of course there may be a
lower-level interface below this, but I see no problem with implementing
this facet.

> - Facilities such as Unicode string collation are heavily data-driven,
> and it would be inefficient to load the data for facilities that are
> not used. This could be avoided by using some sort of lazy loading
> mechanism.

Yep.

> It would still be possible to use the standard locale object as a
> container of an entirely new set of facets, which could be loaded from
> the data sources based on the name of the locale, and ``injected'' into
> an existing locale object, by calling some function. It is not clear,
> however, what advantage this would serve over simply using a
> thin-wrapper over a locale name to represent a ``locale,'' as is done in
> the ICU library.

That would be a really bad idea - no code would take advantage of those
facets, the big advantage of implementing the std ones, is that it gets
iostreams working with Unicode data types, and that would then get
lexical_cast and a whole load of other things working too...

However I think we're getting ahead of ourselves here: I think a Unicode
library should be handled in stages:

1) define the data types for 8/16/32 bit Unicode characters.
2) define iterator adapters to convert a sequence of one Unicode character
type to another.
3) define char_traits specialisations (as necessary) in order to get
basic_string working with Unicode character sequences, typedef the
appropriate string types:

typedef basic_string<utf8_t> utf8_string; // etc

4) define low level access to the core Unicode data properties (in
unidata.txt).
5) Begin to add locale support - a big job, probably a few facets at a time.
6) define iterator adapters for various Unicode algorithms
(composition/decomposition/compression etc).
7) Anything I've forgotten :-)

The main goal would be to define a good clean interface, the implementation
could be:

1) On top of ICU.
2) On top of Platform specific API's (Windows and I believe MacOS X have
some Unicode support without the need to resort to ICU or whatever.
3) An independent Boost implementation (difficult once you get into the
locale specific stuff).

Anyway, I hope these thoughts help,

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk