Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-04-04 06:43:49


On Apr 4, 2005 12:51 PM, Erik Wien <wien_at_[hidden]> wrote:
> Sorry about the late reply. I have been away for easter, and to top it
> all off, been sick a while. Anyway, I'm back...

Great; we'll go on with the discussion.

I'm glad you agree with me on most points. :-)

> Yep.. You are of course right. I should start thinking before I talk. :)

I don't know... thinking out aloud often works in abstract discussions
like this.

> Having strings locked to a normalization form, would be the most logical
> way to go. What I don't really see though, is why you would have to have
> a separate class (different from the code point string class that is)
> for this functionality. If we made the code point string classes (both
> the static and dynamic ones) have a normalization policy and provide a
> policy that doesn't actually do anything, in addition to ones that
> normalize to each of the normalization forms, everyone could have their
> way. If you don't care about normalization, use the do_nothing one. If
> you do care (or simply have no clue what normalization is - most users),
> use NFD or NFC or something.

I'm not sure about this. The simplicity point is a good one. Assuming
you do want to have built-in grapheme cluster support, I do however
see two problems with this approach:

1. You'd still need two kinds of iterators: iterators over codepoints,
and iterators over grapheme clusters. This makes things conceptually
muddy for users, I think. The string class will need codepoint
versions and grapheme cluster versions of many methods (e.g., insert,
erase, find*). You may end up actually implementing two strings in one
string class.

2. Elements are not straightforwardly inserted into the sequence.
E.g., appending 0x317 (a combining character) to a string s will not
make s.back() return 0x317.

In short, a code point string that automatically normalises is not a
Sequence, though it may superficially look like one. I have a feeling
this would be more difficult to understand for users than two separate
string classes would. But maybe that's because I already understand my
own viewpoint?

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk