Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-10-09 11:59:10


On 09.10.2015 18:20, Peter Dimov wrote:
> Andrey Semashev wrote:
>
>> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with
>> them should be the user's explicit choice (e.g. the user should write
>> utf16_to_wtf8 instead of utf16_to_utf8).
>
> The user doesn't write such things in practice. He writes things like
>
> string fn = get_file_name();
> fopen( fn.c_str() );
>
> and get_file_name and fopen must decide how to encode/decode UTF-8. So
> get_file_name gets some wchar_t[] sequence from Windows, which happens
> to be invalid UTF-16. But Windows doesn't care for UTF-16 validity and
> if you pass this same sequence to it, it will be able to open the file.
> So your choice is whether you make this work, or make this fail. I
> choose to make it work.

What I'm saying is that get_file_name implementation should not even
spell UTF-8 anywhere, as the encoding it has to deal with is not UTF-8.
Whatever the original encoding of the file name is (broken UTF-16,
obtained from WinAPI, true UTF-8 obtained from network or a file), the
target encoding has to match what fopen expects. As I remember, on
Windows it is usually not UTF-8 anyway but something like CP1251. But
even if you have a UTF-8 (in Windows terms) locale, Windows code
conversion algorithm AFAIU actually implements something different so
that it can handle invalif UTF-16. Your code should spell that
'something different' and not UTF-8. If it spells UTF-8 then it should
fail on invalid code sequences.

> The functions would of course never produce invalid UTF-8 when passed a
> valid input (and will deterministically produce the least-invalid UTF-8
> for a given input)

There should be no such thing as 'least invalid' or 'almost valid' data.
It's either valid or not. The tool should not produce invalid data,
period. If you want to successfully convert invalid UTF-16 input to a
multibyte encoding then choose that encoding and don't pretend it's
UTF-8. Because UTF-8 cannot represent that input data.

> but here again the definition of valid may change
> with time if, f.ex. more code points are added to Unicode beyond the
> current limit.

Unicode versioning is another issue. If it comes to this, we will decide
what to do. We may well decide to go with utf8v2 in the naming, if the
need for strict v1 conformance is strong enough in some cases.

> You should also keep in mind that Unicode strings can have multiple
> representations even if using strict UTF-8. So one could argue that
> using strict UTF-8 provides a false sense of security.

There are normalization and string collation algorithms to deal with
this. What's important is that the input to these and other algorithms
is valid. Otherwise all bets are off.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk