Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-13 02:47:39


Hi Jeremy,

> The IBM International Components for Unicode (ICU) library
> (http://oss.software.ibm.com/icu/) is an existing C++ library ....
> and although
> there are some C++-specific facilities, most of the C++ API is the same
> as the C API, thus resulting in a less-than-optimal C++ interface.

True. In particular it looks like they use interators which have 'next'
method. Hmm... let me guess why -- IIRC it was Java library initially and
was then ported to C++.

> Nonetheless, I think Boostifying the ICU library would be quite
> feasible, ...

As Miro said, there are alternatives how Boost.Unicode might be related to
ICU, though using code from there is desirable.

> The representation of locales does present an issue that needs to be
> considered. The existing C++ standard locale facets are not very
> suitable for a variety of reasons:
>
> - The standard facets (and the locale class itself, in that it is a
> functor for comparing basic_strings) are tied to facilities such as
> std::basic_string and std::ios_base which are not suitable for
> Unicode support.

We can just forget about locale::operator() ;-) But there are other issues.
For example, 'toupper' takes a charT and returns charT. The Unicode
standard (in 5.18) gives an example of a character which becomes two
characters when uppercased. Also, it might be necessary to look at the
following code point to find if it's composing character.

Other facets, say 'num_put', maybe don't need changes. If it generates data
in UCS-2, that's fine.

> - The interface of std::collate<Ch> is not at all suitable for
> providing all of the functionality desired for Unicode string
> collation. A suitable Unicode collation facility should at least
> allow for user-selection of the strength level used (refer to
> http://www.unicode.org/unicode/reports/tr10/),

Can't you 'imbue' a new facet whenever you need to change something?
It's needed, though, to decide what to use for 'charT' and what encoding to
use. If ICU can compare UTF-16 encoded strings, then it's possible to pass
those strings to 'compare'. I'm don't understand what's 'transform',
though.

> It would still be possible to use the standard locale object as a
> container of an entirely new set of facets, which could be loaded from
> the data sources based on the name of the locale, and ``injected'' into
> an existing locale object, by calling some function. It is not clear,
> however, what advantage this would serve over simply using a
> thin-wrapper over a locale name to represent a ``locale,'' as is done in
> the ICU library.

First, using std::locale would be more familiar. Second, std::locale allows
to use different facets, and that's a good thing in general. E.g. I have
all POSIX locale categories set to "C" except for LC_CTYPE. It would be
inconvenient to have only one locale setting for everything.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk