Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-09 16:07:45


----- Original Message -----

> From: Peter Dimov <lists_at_[hidden]>
> Andrey Semashev wrote:
>
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them
>> should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
>
> The user doesn't write such things in practice. He writes things like
>
> string fn = get_file_name();
> fopen( fn.c_str() );
>
> and get_file_name and fopen must decide how to encode/decode UTF-8. So
> get_file_name gets some wchar_t[] sequence from Windows, which happens to be
> invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass
>

Ok... that is interesting point relevant to Boost.Nowide however irrelevant
to utf8_codecvt facets.

The only way UTF-16 can be invalid is to have non-properly paired utf-16
surrogate units.

They can technically be encoded to invalid UTF-8 prepresenting code points
in closed range reserved to surrogate pairs.

i.e. boost::nowide::narrow should generate invalid UTF-8 from invalid UTF-8
and invalid in very special way UTF-8 to invalid UTF-16.

It looks horrifying for me but it maybe actually solution for such a problem

But this should never-ever-ever be used outside Boost.Nowide

And to be honest - IMHO if a program fails on files that encoded in invalid
UTF-16 when Windows states that the encoding is UTF-16... than I think
they should fail.

> You should also keep in mind that Unicode strings can have multiple

> representations even if using strict UTF-8. So one could argue that using
> strict UTF-8 provides a false sense of security.
>

This isn't correct - you are missing normalization forms and codepoint
representation. Yes properly localized software should generally use normalized
strings.

However a sequence of valid codepoints has one and only one representation

in both UTF-8 and UTF-16.

There is no such things as strict UTF-8 - there is either UTF-8 or not.

Interesting note: on Mac OS X there is a requirement that strings should be
NFC normalized UTF-8 strings.

Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk