Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-10-09 12:05:38


On 09.10.2015 18:41, Peter Dimov wrote:
> Andrey Semashev wrote:
>
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with
>> them should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
>
> In addition to what I wrote earlier, the choices here are not
> representable in a single U or W letter. When taking UTF-8, you need to
> decide whether to
>
> - accept codepoints over 10FFFF
> - accept codepoints encoded with more bytes than necessary
> - accept surrogates
> - probably more because Unicode is hard
>
> and then for each rejected byte sequence whether to
>
> - throw
> - ignore and skip
> - replace with U+FFFD

As long as the code sequences are described by the spec, I consider them
valid. We can provide a number of options to influence the conversion
process, but the result should be something that can be decoded by a
conforming Unicode parser.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk