|
Boost : |
Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 17:38:56
Artyom Beilis wrote:
> > What I meant by that is for instance
> >
> > - is 0xCC 0x81 a valid UTF-8 string?
> > - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?
>
> Both are valid strings.. and both are meaningless on their own i.e. accent
> without letter or two same accents.
>
> Being illogical in human terms or representation does not make them UTF-8
> illegal.
>
> UTF-8 is simple, human language processing is complex.
My point here is that strictly valid UTF-8 is the valid multibyte encoding
of a valid codepoint sequence, and that the definition of "valid codepoint
sequence" may vary depending on context, such that the above sequences are
considered invalid.
Drawing a line at the place where codepoints over 10FFFF and single
surrogates are invalid but the above sequences are valid is an arbitrary
decision. Not that this decision is wrong, it isn't. But it may not be what
the user needs.
Saying "invalid UTF-8 is just invalid, period" doesn't always work very
well, although it's a good default. There are cases in which you have to
handle specific kinds of invalid UTF-8 (but not any invalid UTF-8) and
having to write UTF-8 encoding/decoding functions for every such instance
does not really contribute to either security or correctness. It's better -
I posit - to have functions that can be configured to handle various invalid
forms of UTF-8 (that is, to accept certain invalid UTF-8, not necessarily to
produce it, of course).
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk