Boost :

Date view	Thread view	Subject view	Author view

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-10-21 15:47:17

Next message: Rogier van Dalen: "Re: [boost] Re: Any interest in adding unicode support to boost?"
Previous message: Jonathan Graehl: "[boost] get original argv string(s) for program option?"
In reply to: Eric Niebler: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

In article <41780922.9070406_at_[hidden]>,
"Eric Niebler" <eric_at_[hidden]> wrote:

> Erik Wien wrote:
> > "Rogier van Dalen" <rogiervd_at_[hidden]> wrote in message
> >
> >>I hadn't yet looked at it this way, but you are right from a
> >>theoretical point of view at least. To get more to practical matters,
> >>what do you think this should do:
> >>
> >>unicode::string s = ...;
> >>s += 0xDC01; // An isolated surrogate, which is nonsense
> >>
> >>?
> >>Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT
> >>CHARACTER (Unicode standard 4 Section 2.7), or something else? And what
> >>should the member function with the opposite behaviour be called?
> >
> >
> > The best solution would be to never append single code units, but instead
> > code points. The += operator would determine how many code units is
> > required for the given code point.
>
> I disagree. The user should be allowed to twiddle as many bits as she
> pleases, even permitted to create an invalid UTF string. However,
> operations that interpret the string as a whole (comparison,
> canonicalization, etc.) should detect invalid strings and throw. The
> reason is that people will need to manipulate strings at the bit level,
> and intermediate states may be invalid, but that the final state may be
> valid. We shouldn't do too much nannying during these intermediate states.

I am not sure I buy this. I think that if you want to have unchecked Unicode
data, you should use a vector<char*_t>. Unicode strings have well-defined
invariants with respect to canonicalization and well-formedness, and I think
that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is
not a feature, it's a bug. It's a silent failure that I want to know about.

meeroh

Next message: Rogier van Dalen: "Re: [boost] Re: Any interest in adding unicode support to boost?"
Previous message: Jonathan Graehl: "[boost] get original argv string(s) for program option?"
In reply to: Eric Niebler: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk