Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 16:25:59


Andrey Semashev wrote:
> >> Right. Just don't call it UTF-8 anymore.
> >
> > I don't know what this means.
>
> I mean as a result you will have a string fn, whose encoding is not UTF-8.
> As a consequence algorithms that require UTF-8 input cannot be expected to
> work with this string.

It's invalid UTF-8 and yes, algorithms that require valid UTF-8 will
obviously not work with it.

The point is that the implementation of these functions needs to
encode/decode this not-quite-valid-UTF-8, for which it needs functions that
encode/decode this not-quite-valid-UTF-8.

> > It's an invalid UTF-8 encoding of a valid codepoint sequence.
>
> Yes, but valid codepoint sequence is not enough to interpret the string.

It's enough. What more would you need?

> >> You mean all string-related code should be prepared for invalid input?
> >
> > I don't understand this, either.
>
> You said that properly written code should not require string validity.
> Should such code be always prepared for invalid strings, at any point? If
> so, this looks like unnecessary overhead to me.

I said that properly written code should not require minimal UTF-8 byte
sequences, because properly written code validates the codepoint sequence
(after normalizing it, if required), not the UTF-8 byte sequence.

To expand on that, the reason UTF-8 overlong sequences are a source of
security issues is because of code that does

external input -> validate as NTBS -> ... -> pass to UTF-8 API ->
decoding -> do something

because if validation is supposed to reject ../passwords.txt, the attacker
encodes the dots as two bytes and gets around the naive NTBS validation
which no longer sees '..' but something else.

But the actual problem with this code is that the validation should be done
on the codepoint sequence, not on the byte sequence. And if you do that, you
see the dot as a dot (and the slash as a slash and the NUL as a NUL)
regardless of whether it's encoded with one byte or four.

Anyway, that was a detour. In practice I can't think of valid cases for
accepting overlong sequences except the long zero and maybe not even then.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk