Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-10-21 15:47:17


In article <41780922.9070406_at_[hidden]>,
 "Eric Niebler" <eric_at_[hidden]> wrote:

> Erik Wien wrote:
> > "Rogier van Dalen" <rogiervd_at_[hidden]> wrote in message
> >
> >>I hadn't yet looked at it this way, but you are right from a
> >>theoretical point of view at least. To get more to practical matters,
> >>what do you think this should do:
> >>
> >>unicode::string s = ...;
> >>s += 0xDC01; // An isolated surrogate, which is nonsense
> >>
> >>?
> >>Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT
> >>CHARACTER (Unicode standard 4 Section 2.7), or something else? And what
> >>should the member function with the opposite behaviour be called?
> >
> >
> > The best solution would be to never append single code units, but instead
> > code points. The += operator would determine how many code units is
> > required for the given code point.
>
> I disagree. The user should be allowed to twiddle as many bits as she
> pleases, even permitted to create an invalid UTF string. However,
> operations that interpret the string as a whole (comparison,
> canonicalization, etc.) should detect invalid strings and throw. The
> reason is that people will need to manipulate strings at the bit level,
> and intermediate states may be invalid, but that the final state may be
> valid. We shouldn't do too much nannying during these intermediate states.

I am not sure I buy this. I think that if you want to have unchecked Unicode
data, you should use a vector<char*_t>. Unicode strings have well-defined
invariants with respect to canonicalization and well-formedness, and I think
that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is
not a feature, it's a bug. It's a silent failure that I want to know about.

meeroh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk