Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-03-20 19:50:04


Rogier van Dalen wrote:
> <snip>
> I believe we are talking about different kinds of users. Let's get
> this clear: I was assuming that the Unicode library will be aimed at
> programmers doing everyday programming jobs whose programs will have
> to deal with non-English characters (because they're bound to be
> localised, or because non-English names will be inserted in a
> database, or whatever), i.e. people who have no idea about how Unicode
> works and don't want to, as long as it does work.

That was my initial thought. This Unicode library should in my opinion
make handling Unicode strings correctly as easy as it is to handle ASCII
strings today. But that does not mean we will have to put mittens on
everyone else to keep them away from the lower details. If you need to
manipulate code points, I think you should be allowed to. Code units on
the other hand, I'm a little more wary about, since users easily could
screw things up on that level. (Make a sequence ill-formed.) Furthermore
I don't really see why anyone would need to muck about with code units.

> What I think would be a good interface:
>
> // A string of code points, encoded UTF-16 (or templated).
> class code_point_string {
> public:
> //...
> const std::basic_string<char16_t> code_units();
> };
>
> // A string of "grapheme clusters", with a code_point_string underlying.
> // The string is always in a normalisation form.
> template <class NormalisationPolicy = NormalisationFormC>
> class unicode_string
> {
> public:
> //...
> const code_point_string & code_points() const;
> };
>
> Those who need to process code points can happily use
> code_point_string; others can use unicode_string.

This is starting to look more and more like the way to go in my opinion.
By layering interfaces with an increasing level of abstraction (from
code points and up), we could more or less keep everyone happy.

What I really don't like about this solution, is that we would end up
with a myriad of different types that all are "unicode strings", but at
different levels. I can easily imagine mayhem erupting when everyone get
their favorite unicode abstraction and use that one exclusively in their
APIs. Passing strings around would be a complete nightmare.

One solution could be to make code points the "base level" of
abstraction, and used normalization policies (like you outlined) for
functions where normalization form actually matters (find etc.), we
could still get most of the functionality a grapheme_cluster_string
would provide, but without the extra types.

I'm just afraid that if we have a code_point_string in all encodings,
plus the dynamic one, in addition to the same number of strings at the
grapheme cluster level, there would simply be too many of them, and it
would confuse the users more that it would help them.

Feel free to convince me otherwise though.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk