Boost logo

Boost :

From: Peter Dimov (pdimov_at_[hidden])
Date: 2004-10-19 18:45:38


Erik Wien wrote:
> Ultimately I feel that the operation of normalization (which involves
> canonical decomposition) of unicode strings should be hidden from the
> user completely and be performed automatically by the library where
> that is needed. (Like on a call to the == operator.)

It appears that there are two schools of thought when it comes to string
design. One approach treats a string purely as a sequential container of
values. The other tries to represent "string values" as a coherent whole. It
doesn't help that in the simple case where the value_type is char the two
approaches result in mostly identical semantics.

My opinion is that the std::char_traits<> experiment failed and conclusively
demonstrated that the "string as a value" approach is a dead end, and that
practical string libraries must treat a string as a sequential container,
vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value
representation needs to be done by algorithms.

In other words, I believe that string::operator== should always perform the
per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() )
that is specified in the Container requirements table.

If I want to test whether two sequences of char16_t's, interpreted as UTF16
Unicode strings, would represent the same string in a printed form, I should
be given a dedicated function that does just that - or an equivalent.
Similarly, if I want to normalize a sequence of chars that are actually
UTF8, I'd call the appropriate 'normalize' function/algorithm.

But I may be wrong. :-)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk