Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 15:15:38


Andrey Semashev wrote:
> On 09.10.2015 19:27, Peter Dimov wrote:
> >> > string fn = get_file_name();
> >> > fopen( fn.c_str() );
...
> > get_file_name and fopen work in tandem to make it so that the file
> > selected by the first function is opened by the latter. And to do that,
> > they may need to put invalid UTF-8 in 'fn'.
>
> Right. Just don't call it UTF-8 anymore.

I don't know what this means.

> > There exists a legitimate notion of more valid or less valid UTF-8
> > because it can be invalid in different ways, some more basic than
> > others.
>
> Could you point me to a definition of these degrees of validity?

First, you can have invalid multibyte sequences in the input.
Second, you can have overlong byte sequences.
Third, the encoded codepoint sequence may be invalid, in various ways.

> > This depends on the notion of valid. UTF-8 that encodes codepoints in
> > more bytes than necessary corresponds to a valid codepoint sequence.
>
> AFAIU, no, it is not a valid encoding.

It's an invalid UTF-8 encoding of a valid codepoint sequence.

> You mean all string-related code should be prepared for invalid input?

I don't understand this, either.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk