Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2004-10-19 19:34:40


Peter Dimov wrote:
> It appears that there are two schools of thought when it comes to string
> design. One approach treats a string purely as a sequential container of
> values. The other tries to represent "string values" as a coherent whole.
> It doesn't help that in the simple case where the value_type is char the
> two approaches result in mostly identical semantics.
>
> My opinion is that the std::char_traits<> experiment failed and
> conclusively demonstrated that the "string as a value" approach is a dead
> end, and that practical string libraries must treat a string as a
> sequential container, vector<char>, vector<char16_t> and vector<char32_t>
> in our case.
>
> The interpretation of that sequence of integers as a concrete string value
> representation needs to be done by algorithms.

That is kinda what my current implementation does, but the container is not
directly accessible by the user. (Nor do I think it should be) Instead I
wrap the vector of code points in a class and provide different types of
iterators to iterate though the vector at different "character levels",
instead of external algorithms. You can therefore access the string on a
code unit level, but the casual user would not neccesarily know (or care)
about that. Instead he would use the "string as a value" approach, using
strings to represent a sentance, word, or some other language construct.
When most people think of a string, they think of text, and not the
underlying binary representation, and therefore that is, in my opinion, the
notion a library should be designed around.

> In other words, I believe that string::operator== should always perform
> the per-element comparison std::equal( lhs.begin(), lhs.end(),
> rhs.begin() ) that is specified in the Container requirements table.
>
> If I want to test whether two sequences of char16_t's, interpreted as
> UTF16 Unicode strings, would represent the same string in a printed form,
> I should be given a dedicated function that does just that - or an
> equivalent. Similarly, if I want to normalize a sequence of chars that are
> actually UTF8, I'd call the appropriate 'normalize' function/algorithm.

Though I see where you are coming from, I don't agree with you on that. In
my opinion a good unicode library should hide as much as possible of the
complexity of the actual character representation from the user. If we were
to require the user to know that a direct binary comparison of strings is
not the same as a actual textual comparison, we loose some of the simplicity
of the library. Most users that use such a library would not know that the
character ö can be represented as both 'o¨' and 'ö', and that as a
consequence of that, calling == on to strings could result in the behaviour
"ö" != "ö". By removing the need for such knowledge by the user, we reduce
the learning curve considerably, which is one of the main reasons for
abstracting this functionality anyway.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk