Boost logo

Boost :

From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2020-01-08 01:24:16


On Tue, Jan 7, 2020 at 5:16 PM Peter Dimov via Boost <boost_at_[hidden]>
wrote:

> Gavin Lambert wrote:
> > But the conversion from WTF-8 to UCS-16 can interpret the joining point
> as
> > a different character, resulting in a different sequence. Unless I've
> > misread something, this could occur if the first string ended in an
> > unpaired high surrogate and the second started with an unpaired low
> > surrogate (or rather the WTF-8 equivalents thereof).
>
> I don't see why do you think this would present a problem. The conversion
> of
> the first string will end in an unpaired high surrogate. The conversion of
> the second string will start with an unpaired low surrogate. The two, when
> concatenated, will form a valid UTF-16 encoding of a non-BMP character.
> Where is the issue here?
>

That's my point essentially. However Gavin refers to the fact that the
current WTF-8 spec explicitly says that an encoding of high/low surrogate
pairs is invalid in WTF-8.

For example

UTF-16: d83d de09

should be encoded as

WTF-8: f0 9f 98 89

But if one "UTF-16" string ended in d83d and the other in de09,
concatenating in WTF-8 would yield

"Invalid WTF-8": ed a0 bd ed b8 89

The spec explicitly prohibits this. The rationale behind this is to have a
unique representation of any "UTF-16" stream, just like UTF-8 requires
shortest representations. It might be important for security reasons if
you're going to compare those "invalid WTF-8" strings, but it is not an
issue if the next thing you do is converting them back to UTF-16.

-- 
Yakov Galka
http://stannum.co.il/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk