From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2020-01-08 01:24:16
On Tue, Jan 7, 2020 at 5:16 PM Peter Dimov via Boost <boost_at_[hidden]>
> Gavin Lambert wrote:
> > But the conversion from WTF-8 to UCS-16 can interpret the joining point
> > a different character, resulting in a different sequence. Unless I've
> > misread something, this could occur if the first string ended in an
> > unpaired high surrogate and the second started with an unpaired low
> > surrogate (or rather the WTF-8 equivalents thereof).
> I don't see why do you think this would present a problem. The conversion
> the first string will end in an unpaired high surrogate. The conversion of
> the second string will start with an unpaired low surrogate. The two, when
> concatenated, will form a valid UTF-16 encoding of a non-BMP character.
> Where is the issue here?
That's my point essentially. However Gavin refers to the fact that the
current WTF-8 spec explicitly says that an encoding of high/low surrogate
pairs is invalid in WTF-8.
UTF-16: d83d de09
should be encoded as
WTF-8: f0 9f 98 89
But if one "UTF-16" string ended in d83d and the other in de09,
concatenating in WTF-8 would yield
"Invalid WTF-8": ed a0 bd ed b8 89
The spec explicitly prohibits this. The rationale behind this is to have a
unique representation of any "UTF-16" stream, just like UTF-8 requires
shortest representations. It might be important for security reasons if
you're going to compare those "invalid WTF-8" strings, but it is not an
issue if the next thing you do is converting them back to UTF-16.
-- Yakov Galka http://stannum.co.il/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk