Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 12:27:47


Andrey Semashev wrote:
> > string fn = get_file_name();
> > fopen( fn.c_str() );
>
> What I'm saying is that get_file_name implementation should not even spell
> UTF-8 anywhere, as the encoding it has to deal with is not UTF-8. Whatever
> the original encoding of the file name is (broken UTF-16, obtained from
> WinAPI, true UTF-8 obtained from network or a file), the target encoding
> has to match what fopen expects.

'fopen' here is a function that decodes 'fn' and calls _wfopen, or
CreateFileW, or whatever is appropriate.

get_file_name and fopen work in tandem to make it so that the file selected
by the first function is opened by the latter. And to do that, they may need
to put invalid UTF-8 in 'fn'.

> There should be no such thing as 'least invalid' or 'almost valid' data.

There exists a legitimate notion of more valid or less valid UTF-8 because
it can be invalid in different ways, some more basic than others.

> There are normalization and string collation algorithms to deal with this.
> What's important is that the input to these and other algorithms is valid.

This depends on the notion of valid. UTF-8 that encodes codepoints in more
bytes than necessary corresponds to a valid codepoint sequence. Strict
handling rejects it not because it's invalid Unicode, but because it's not
the canonical representation of the codepoint sequence. But the codepoint
sequence itself can be non-canonical, and hence code that assumes that
"validated" UTF-8 is canonical is wrong.

The policy of strict UTF-8 is not a bad idea in general, but it's merely a
first line of defense as far as security is concerned. Properly written code
should not need it.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk