Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-10-09 12:05:38

On 09.10.2015 18:41, Peter Dimov wrote:
> Andrey Semashev wrote:
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with
>> them should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
> In addition to what I wrote earlier, the choices here are not
> representable in a single U or W letter. When taking UTF-8, you need to
> decide whether to
> - accept codepoints over 10FFFF
> - accept codepoints encoded with more bytes than necessary
> - accept surrogates
> - probably more because Unicode is hard
> and then for each rejected byte sequence whether to
> - throw
> - ignore and skip
> - replace with U+FFFD

As long as the code sequences are described by the spec, I consider them
valid. We can provide a number of options to influence the conversion
process, but the result should be something that can be decoded by a
conforming Unicode parser.

Boost list run by bdawes at, gregod at, cpdaniel at, john at