From: Erik Wien (wien_at_[hidden])
Date: 2005-04-04 05:51:09
Sorry about the late reply. I have been away for easter, and to top it
all off, been sick a while. Anyway, I'm back...
> I'm not too sure how you envision using normalisation policies for functions.
> However, the problem I see with it is that normalisation form is not a
> property of a function. A normalisation form is a property of a
> string. I think it should be an invariant of that string.
> Imagine a std::map<> where you use a Unicode string as a key; you want
> equivalent strings to map to the same object. operator< for two
> strings with the same normalisation form and the same encoding is
> trivial (and as fast as std::basic_string::operator< for UTF-8 or
> UTF-32). On two strings with unknown normalisation forms, it will be
> dreadfully much slower because you'll need to look things up in the
> Unicode database all the time.
Yep.. You are of course right. I should start thinking before I talk. :)
Having strings locked to a normalization form, would be the most logical
way to go. What I don't really see though, is why you would have to have
a separate class (different from the code point string class that is)
for this functionality. If we made the code point string classes (both
the static and dynamic ones) have a normalization policy and provide a
policy that doesn't actually do anything, in addition to ones that
normalize to each of the normalization forms, everyone could have their
way. If you don't care about normalization, use the do_nothing one. If
you do care (or simply have no clue what normalization is - most users),
use NFD or NFC or something.
>>What I really don't like about this solution, is that we would end up
>>with a myriad of different types that all are "unicode strings", but at
>>different levels. I can easily imagine mayhem erupting when everyone get
>>their favorite unicode abstraction and use that one exclusively in their
>>APIs. Passing strings around would be a complete nightmare.
>>I'm just afraid that if we have a code_point_string in all encodings,
>>plus the dynamic one, in addition to the same number of strings at the
>>grapheme cluster level, there would simply be too many of them, and it
>>would confuse the users more that it would help them.
> As long as there is one boost::unicode_string, I speculate this
> shouldn't be much of a problem.
I hope you are right, because if it turns out to be a problem, it will
be a major one! What do the rest of you think? Would a large number of
different classes lead to confusion, or would a unicode_string typedef
hide this complexity?
Developers wanting to make another
> choice than you have made I think will fall into either of two
> - Those who know about Unicode and are not easily confused by
> encodings and normalisation forms;
> - and those who worry about performance.
Yep, that sounds about right. Most users should not really care what
kind of encoding and normalization form is used. They want to work with
the string, not fiddle with it's internal representation.
With a good rationale (based
> on measured performance in a number of test cases), you should be able
> to pick one that's good enough in most situations, I think. (Looking
> at the ICU website, I'd say this would involve UTF-16, but let's see
> what you come up with.)
I would be surprised if any other encoding than UTF-16 would end up as
the most efficient one. UTF-8 suffers from the big variation in code
unit count for any given code point and UTF-32 is just a waste of space
for little performance for most users. You never know though.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk