Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-09 16:11:28


> Andrey Semashev wrote:

>
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them
>> should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
>
> In addition to what I wrote earlier, the choices here are not representable
> in a single U or W letter. When taking UTF-8, you need to decide whether to
>
> - accept codepoints over 10FFFF
> - accept codepoints encoded with more bytes than necessary
> - accept surrogates

No... all this isn't UTF-8. Period. Codepoints above 10FFFF is like lets assume Pi=3.15..

That is why the C++11 <codecvt> has basic design flaws. (See notes in previous e-mails)

> - probably more because Unicode is hard

Unicode isn't hard - it is just treated with ignorance by even big
organization not talking about average programmers.

Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk