Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-09 16:07:45

----- Original Message -----

> From: Peter Dimov <lists_at_[hidden]>
> Andrey Semashev wrote:
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them
>> should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
> The user doesn't write such things in practice. He writes things like
> string fn = get_file_name();
> fopen( fn.c_str() );
> and get_file_name and fopen must decide how to encode/decode UTF-8. So
> get_file_name gets some wchar_t[] sequence from Windows, which happens to be
> invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass

Ok... that is interesting point relevant to Boost.Nowide however irrelevant
to utf8_codecvt facets.

The only way UTF-16 can be invalid is to have non-properly paired utf-16
surrogate units.

They can technically be encoded to invalid UTF-8 prepresenting code points
in closed range reserved to surrogate pairs.

i.e. boost::nowide::narrow should generate invalid UTF-8 from invalid UTF-8
and invalid in very special way UTF-8 to invalid UTF-16.

It looks horrifying for me but it maybe actually solution for such a problem

But this should never-ever-ever be used outside Boost.Nowide

And to be honest - IMHO if a program fails on files that encoded in invalid
UTF-16 when Windows states that the encoding is UTF-16... than I think
they should fail.

> You should also keep in mind that Unicode strings can have multiple

> representations even if using strict UTF-8. So one could argue that using
> strict UTF-8 provides a false sense of security.

This isn't correct - you are missing normalization forms and codepoint
representation. Yes properly localized software should generally use normalized

However a sequence of valid codepoints has one and only one representation

in both UTF-8 and UTF-16.

There is no such things as strict UTF-8 - there is either UTF-8 or not.

Interesting note: on Mac OS X there is a requirement that strings should be
NFC normalized UTF-8 strings.


Boost list run by bdawes at, gregod at, cpdaniel at, john at