Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2004-10-21 14:08:18


Erik Wien wrote:
> "Rogier van Dalen" <rogiervd_at_[hidden]> wrote in message
>
>>I hadn't yet looked at it this way, but you are right from a
>>theoretical point of view at least. To get more to practical matters,
>>what do you think this should do:
>>
>>unicode::string s = ...;
>>s += 0xDC01; // An isolated surrogate, which is nonsense
>>
>>?
>>Should it throw, or convert the isolated surrogate to U+FFFD
>>REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something
>>else? And what should the member function with the opposite behaviour
>>be called?
>
>
> The best solution would be to never append single code units, but instead
> code points. The += operator would determine how many code units is required
> for the given code point.
>

I disagree. The user should be allowed to twiddle as many bits as she
pleases, even permitted to create an invalid UTF string. However,
operations that interpret the string as a whole (comparison,
canonicalization, etc.) should detect invalid strings and throw. The
reason is that people will need to manipulate strings at the bit level,
and intermediate states may be invalid, but that the final state may be
valid. We shouldn't do too much nannying during these intermediate states.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk