Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-04-06 12:33:24


Rogier van Dalen wrote:
> Great; we'll go on with the discussion.
>
> I'm glad you agree with me on most points. :-)

Great, isn't it? :D

>>Having strings locked to a normalization form, would be the most logical
>>way to go. What I don't really see though, is why you would have to have
>>a separate class (different from the code point string class that is)
>>for this functionality. If we made the code point string classes (both
>>the static and dynamic ones) have a normalization policy and provide a
>>policy that doesn't actually do anything, in addition to ones that
>>normalize to each of the normalization forms, everyone could have their
>>way. If you don't care about normalization, use the do_nothing one. If
>>you do care (or simply have no clue what normalization is - most users),
>>use NFD or NFC or something.

Based on the comments made by Miro in the other thread (and you in
resonse to this), I'm going to disagree with myself on that point.
(How's that for a change?) Normalization in a code-point string, would
lead to many problems when searching and inserting I never even thought
of. A grapheme-cluster string makes more and more sense the more I think
about it. I'll throw some ideas around over the weekend (Don't hold me
to this), and see if I come up with a smart way of implementing
something like that.

> I'm not sure about this. The simplicity point is a good one. Assuming
> you do want to have built-in grapheme cluster support, I do however
> see two problems with this approach:
>
> 1. You'd still need two kinds of iterators: iterators over codepoints,
> and iterators over grapheme clusters. This makes things conceptually
> muddy for users, I think. The string class will need codepoint
> versions and grapheme cluster versions of many methods (e.g., insert,
> erase, find*). You may end up actually implementing two strings in one
> string class.
>
> 2. Elements are not straightforwardly inserted into the sequence.
> E.g., appending 0x317 (a combining character) to a string s will not
> make s.back() return 0x317.
>
> In short, a code point string that automatically normalises is not a
> Sequence, though it may superficially look like one. I have a feeling
> this would be more difficult to understand for users than two separate
> string classes would. But maybe that's because I already understand my
> own viewpoint?

All true. You have me convinced, sir.

- Erik


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk