Boost logo

Boost :

From: Gavin Lambert (boost_at_[hidden])
Date: 2020-01-07 23:16:52


On 7/01/2020 14:58, Yakov Galka wrote:
>> So, while unfortunate, v3 made the correct choice. Paths have to be
>> kept in their original encoding between original source (command line,
>> file, or UI) and file API usage, otherwise you can get weird errors when
>> transcoding produces a different byte sequence that appears identical
>> when actually rendered, but doesn't match the filesystem. Transcoding
>> is only safe when you're going to do something with the string other
>> than using it in a file API.
>
> See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
> and back losslessly. The unprecedented introduction of a platform specific
> interface into the standard was, still is, and will always be, a horrible
> mistake.

Given that WTF-8 is not itself supported by the C++ standard library
(and the other formats are), that doesn't seem like a valid argument.
You'd have to campaign for that to be added first.

The main problem though is that once you start allowing transcoding of
any kind, it's a slippery slope to other conversions that can make lossy
changes (such as applying different canonicalisation formats, or
adding/removing layout codepoints such as RTL markers).

Also, if you read the WTF-8 spec, it notes that it is not legal to
directly concatenate two WTF-8 strings (you either have to convert it
back to UCS-16 first, or execute some special handling for the trailing
characters of the first string), which immediately renders it a poor
choice for a path storage format. And indeed a poor choice for any
purpose. (I suspect many people who are using it have conveniently
forgotten that part.)

Although on a related note, I think C++11/17 dropped the ball a bit on
the new encoding-specific character types. It's definitely an
improvement on the prior method, but it would have been better to do
something like:

     struct ansi_encoding_t;
     struct utf_encoding_t;
     typedef encoded_char<ansi_encoding_t, 8> char_t;
     typedef encoded_char<utf_encoding_t, 8> char8_t;
     typedef encoded_char<utf_encoding_t, 16> char16_t;

Where "encoded_char<E,N>" has storage size equal to N bits (blittable,
and otherwise behaves like a standard integer type) but also carries
around an arbitrary encoding tag type E. This could be used to
distinguish "a string encoded in UTF-8" from "a string encoded in WTF-8"
or "a string encoded in EDBDIC". And supplemental libraries could
define additional encodings and conversion functions, and algorithms
could operate on generic strings of any encoding.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk