Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-03-21 06:10:23


[Rearranging paragraphs from your post]

On Mon, 21 Mar 2005 01:50:04 +0100, Erik Wien <wien_at_[hidden]> wrote:
> One solution could be to make code points the "base level" of
> abstraction, and used normalization policies (like you outlined) for
> functions where normalization form actually matters (find etc.), we
> could still get most of the functionality a grapheme_cluster_string
> would provide, but without the extra types.

I'm not too sure how you envision using normalisation policies for functions.
However, the problem I see with it is that normalisation form is not a
property of a function. A normalisation form is a property of a
string. I think it should be an invariant of that string.

Imagine a std::map<> where you use a Unicode string as a key; you want
equivalent strings to map to the same object. operator< for two
strings with the same normalisation form and the same encoding is
trivial (and as fast as std::basic_string::operator< for UTF-8 or
UTF-32). On two strings with unknown normalisation forms, it will be
dreadfully much slower because you'll need to look things up in the
Unicode database all the time.

> What I really don't like about this solution, is that we would end up
> with a myriad of different types that all are "unicode strings", but at
> different levels. I can easily imagine mayhem erupting when everyone get
> their favorite unicode abstraction and use that one exclusively in their
> APIs. Passing strings around would be a complete nightmare.

> I'm just afraid that if we have a code_point_string in all encodings,
> plus the dynamic one, in addition to the same number of strings at the
> grapheme cluster level, there would simply be too many of them, and it
> would confuse the users more that it would help them.

As long as there is one boost::unicode_string, I speculate this
shouldn't be much of a problem. Developers wanting to make another
choice than you have made I think will fall into either of two
categories:
- Those who know about Unicode and are not easily confused by
encodings and normalisation forms;
- and those who worry about performance. With a good rationale (based
on measured performance in a number of test cases), you should be able
to pick one that's good enough in most situations, I think. (Looking
at the ICU website, I'd say this would involve UTF-16, but let's see
what you come up with.)

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk