Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 11:20:13


Andrey Semashev wrote:

> WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them
> should be the user's explicit choice (e.g. the user should write
> utf16_to_wtf8 instead of utf16_to_utf8).

The user doesn't write such things in practice. He writes things like

    string fn = get_file_name();
    fopen( fn.c_str() );

and get_file_name and fopen must decide how to encode/decode UTF-8. So
get_file_name gets some wchar_t[] sequence from Windows, which happens to be
invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass
this same sequence to it, it will be able to open the file. So your choice
is whether you make this work, or make this fail. I choose to make it work.

The functions would of course never produce invalid UTF-8 when passed a
valid input (and will deterministically produce the least-invalid UTF-8 for
a given input) but here again the definition of valid may change with time
if, f.ex. more code points are added to Unicode beyond the current limit.

You should also keep in mind that Unicode strings can have multiple
representations even if using strict UTF-8. So one could argue that using
strict UTF-8 provides a false sense of security.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk