Boost logo

Boost :

From: Gavin Lambert (boost_at_[hidden])
Date: 2020-01-08 01:07:57


On 8/01/2020 12:57, Yakov Galka wrote:
> Paths are, almost always, concatenated with ASCII separators (or other
> valid strings) in-between. Even when concatenating malformed strings
> directly, the issue isn't there if the result is passed immediately back to
> the "UTF-16" system.

But the conversion from WTF-8 to UCS-16 can interpret the joining point
as a different character, resulting in a different sequence. Unless
I've misread something, this could occur if the first string ended in an
unpaired high surrogate and the second started with an unpaired low
surrogate (or rather the WTF-8 equivalents thereof). Unlikely, perhaps,
but not impossible.

>> Although on a related note, I think C++11/17 dropped the ball a bit on
>> the new encoding-specific character types. [...]
>
> C++11 over-engineered it, and you keep over-engineering it even further.
> Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC
> strings in one program *at compile time*.

You've just suggested cases where apps will contain both UTF-8 and
WTF-8, which would be useful to distinguish between at compile time --
both to allow overloading to automatically select the correct conversion
function and to give you compile errors if you accidentally try to pass
a WTF-8 string to a function that expects pure UTF-8, or vice versa.

The same applies for other cases. That's why C++20 introduced char8_t,
so that you wouldn't accidentally pass UTF-8 strings to methods
expecting other char formats.

This could even be extended to other forms of two-way data encoding,
such as UUEncoding or Base64. I don't think that's over-engineering,
that's just basic data conversion and type safety.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk