Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-10-09 11:03:21


On 09.10.2015 17:41, Peter Dimov wrote:
> Beman Dawes wrote:
>
>> IMO, a critical aspect of all of those, including utf-8 to utf-8, is
>> that they detect all utf-8 errors since ill-formed utf-8 is used as an
>> attack vector.
>
> That is what I alluded to earlier with my bikeshedding comment - I
> personally find this policy a bit too firm for my taste. Sure, sometimes
> I do want to reject any invalid UTF-8 with extreme prejudice, but at
> other times I do not. For instance, when I get a Windows file name, it
> can well be invalid UTF-16, which when converted will become invalid
> UTF-8 but which will roundtrip correctly back to its original invalid
> UTF-16 form and refer to the same file. That's why things like CESU-8 or
> WTF-8 exist.
>
> So I like the "method" argument of locale::conv::utf_to_utf, except that
> I think that it doesn't offer enough control.

I think, UTF-8 is UTF-8 (i.e. the character encoding that is described
by the standard), and the tool for working with it should adhere to the
specification. This includes signalling about invalid code sequences
instead of producing them.

WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with
them should be the user's explicit choice (e.g. the user should write
utf16_to_wtf8 instead of utf16_to_utf8).


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk