Boost logo

Boost :

From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2020-01-07 23:57:34


On Tue, Jan 7, 2020 at 3:17 PM Gavin Lambert via Boost <
boost_at_[hidden]> wrote:

> > See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
> > and back losslessly. The unprecedented introduction of a platform
> specific
> > interface into the standard was, still is, and will always be, a horrible
> > mistake.
>
> Given that WTF-8 is not itself supported by the C++ standard library
> (and the other formats are), that doesn't seem like a valid argument.
> You'd have to campaign for that to be added first.
>

It doesn't need to be added to the standard. My claim was that instead of
adding a wchar_t/char Heisenstring into the standard and proliferating the
amount of fstream constructors, one could stick to char interfaces and
demand that "basic execution character set would be capable of storing any
Unicode data". An Windows implementation could do that with WTF-8 to allow
lossless transcoding.

The main problem though is that once you start allowing transcoding of
> any kind, it's a slippery slope to other conversions that can make lossy
> changes (such as applying different canonicalisation formats, or
> adding/removing layout codepoints such as RTL markers).
>

The truth is that there's already transcoding happening. Mount a Windows
partition on Unix or vice versa. It's expected to have some breakage there
if the filenames contain invalid sequences.

> Also, if you read the WTF-8 spec, it notes that it is not legal to
> directly concatenate two WTF-8 strings (you either have to convert it
> back to UCS-16 first, or execute some special handling for the trailing
> characters of the first string), which immediately renders it a poor
> choice for a path storage format. And indeed a poor choice for any
> purpose. (I suspect many people who are using it have conveniently
> forgotten that part.)
>

Paths are, almost always, concatenated with ASCII separators (or other
valid strings) in-between. Even when concatenating malformed strings
directly, the issue isn't there if the result is passed immediately back to
the "UTF-16" system.

> Although on a related note, I think C++11/17 dropped the ball a bit on
> the new encoding-specific character types. [...]
>

C++11 over-engineered it, and you keep over-engineering it even further.
Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC
strings in one program *at compile time*.

-- 
Yakov Galka
http://stannum.co.il/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk