Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-07-25 05:15:43


Dear Graham,

You seem to have a lot of valuable experience with Unicode.
I feel that the interface is the more urgent matter for now. The
implementation of the tables can always be changed as long as the
interface remains stable. Let me rephrase that: I think the issues of
programmer's interface and data tables are orthogonal, and programmers
will care most about the interface, and then about performance. This
is not to say you don't give a whole lot of important and difficult
issues that must be solved at some point.
As far as I can see, C++ provides the machinery to abstract away from
code points, which gives an opportunity for hiding more of the
complexity of Unicode than other programming languages. I would love
to see a library which hides the issue of different normalisation
forms from the programmer.

As for the scope of the library:

> How have you hooked in dictionary word break support for languages like
> Thai
IMO that would be beyond the scope of a general Unicode library.

> How far have you gone? Do you have support for going from logical to
> display on combined ltor and rtol ? Customised glyph conversion for
> Indic Urdu?
Correct me if I'm wrong, but I think these issues become important
only when rendering Unicode strings. Aren't they thus better handled
by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost
Unicode library should focus on processing symbolic Unicode strings
and keep away from what happens when they are displayed, just like
std::basic_string does.

> How can we ensure that other boost projects understand the implication
> of Unicode support and the subtle changes required, e.g. hooks to allow
> for canonical decomposition on string data portions of regular
> expressions in the regexpr project?

As long as the Unicode string abstracts away from normalisation forms,
level 2 Unicode support for regular expressions should basically come
for free, I believe. In general, the Unicode library should
incorporate as much Unicode-specific machinery as possible, leaving as
little difficulty for other library authors as possible. Docs are
important here.
Why would you want to do canonical decomposition explicitly in a
regular expression?

> We will need to create a utility to take the 'raw'/ published unicode
> data files along with user defined private characters to make these
> tables which would then be used by the set of functions that we will
> agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.

I find the idea of users embedding private character properties
_within_ the standard Unicode tables, and building their own slightly
different version of the Unicode library, scary. Why is this needed?

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk