Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-10-09 15:39:11


On 09.10.2015 22:15, Peter Dimov wrote:
> Andrey Semashev wrote:
>> On 09.10.2015 19:27, Peter Dimov wrote:
>> >> > string fn = get_file_name();
>> >> > fopen( fn.c_str() );
> ...
>> > get_file_name and fopen work in tandem to make it so that the file
>> > selected by the first function is opened by the latter. And to do that,
>> > they may need to put invalid UTF-8 in 'fn'.
>>
>> Right. Just don't call it UTF-8 anymore.
>
> I don't know what this means.

I mean as a result you will have a string fn, whose encoding is not
UTF-8. As a consequence algorithms that require UTF-8 input cannot be
expected to work with this string.

>> > There exists a legitimate notion of more valid or less valid UTF-8 >
>> because it can be invalid in different ways, some more basic than >
>> others.
>>
>> Could you point me to a definition of these degrees of validity?
>
> First, you can have invalid multibyte sequences in the input.
> Second, you can have overlong byte sequences.
> Third, the encoded codepoint sequence may be invalid, in various ways.

Ok, all these count as just invalid to me.

>> > This depends on the notion of valid. UTF-8 that encodes codepoints
>> in > more bytes than necessary corresponds to a valid codepoint sequence.
>>
>> AFAIU, no, it is not a valid encoding.
>
> It's an invalid UTF-8 encoding of a valid codepoint sequence.

Yes, but valid codepoint sequence is not enough to interpret the string.

>> You mean all string-related code should be prepared for invalid input?
>
> I don't understand this, either.

You said that properly written code should not require string validity.
Should such code be always prepared for invalid strings, at any point?
If so, this looks like unnecessary overhead to me.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk