Boost logo

Boost :

From: Daniela Engert (dani_at_[hidden])
Date: 2022-08-16 06:27:16


Am 16.08.2022 um 05:15 schrieb Gavin Lambert via Boost:
> On 16/08/2022 11:53, Vinnie Falco wrote:
>> My experiences with std::filesystem and boost::filesystem have been
>> nothing but negative. I think that the decision to make the character
>> type different on Windows was a mistake. The need for locales and
>> imbuements and global state and... really, it is just giving me a big
>> headache.
>
> Using wchar_t on Windows is actually the least painful option. (And
> you don't have to worry about locales and imbuements etc if you never
> try to convert to not-wchar_t.)
>
> For correct behaviour, you *must* only use the W variants of the
> native API methods, or wchar_t methods of standard library functions.
>
> Inevitably, everything in the standard library that accepts 'char'
> params assumes that these are encoded in the ANSI code page, not
> UTF-8. This can't be "fixed" or it breaks all the legacy apps.
>
> In practice, this means that unless you can absolutely guarantee that
> your paths only contain pure ASCII (and the instant you accept a path
> or filename from the user, you lose), it is *never* safe to use any of
> the non-wide library methods.
>
> You *can* (and many do) store paths in other libraries and in the
> application in 'char'-encoded-as-UTF-8, but then you have to remember
> every single time you hit the standard library or direct WinAPI
> boundaries to convert your strings to wide before passing them across,
> or hilarity will ensue (without even a convenient compiler error).
>
> Storing paths as wchar_t in the first place both avoids the cost of
> converting back and forth and potential corruption (often overlooked,
> unless you regularly test with unicode paths) from accidentally
> forgetting a conversion.
>
>> (where is the signature of fopen that accepts a filesystem::path?)
>
> Why are you using fopen in C++ in the first place?
>
> Filesystem does provide 'path' overloads for fstreams, which you
> should have been using instead anyway.
>
>> It should be utf-8 only, use Plain Old char (even on Windows), it should
>> be completely portable, except that it requires that directories are
>> possible and that the filesystem isn't weird (I don't really care
>> about compatibility with grandpa's EPROMs that can hold 9-bit flat
>> files).
>
> In theory, the standard library (and other wrapper libraries around
> the WinAPI, including Filesystem) could start doing more sane things
> by using the C++20 'char8_t'/'u8string' types to disambiguate between
> UTF-8 encoded paths and legacy idkwtf-'char'-encoded paths.  But this
> will take a very long time to percolate through the ecosystem,
> especially as there are a bunch of people who hate the very idea of
> it.  And it doesn't solve the conversion performance angle.
>
> (Hopefully, Windows will eventually provide char8_t entrypoints and
> APIs, which will make it easier to interoperate with not-Windows.)
>
> Although as Emil has already pointed out, it's valid in not-Windows to
> have arbitrary not-UTF-8 byte sequences in paths, so you can get into
> trouble in that direction as well.
>
> That's another reason for using wchar_t in Windows and char in
> not-Windows: no conversions happen at all (at least where values are
> accepted natively from the OS), which has maximal compatibility for
> otherwise-invalid byte sequences that nevertheless exist.

Amen brother, you speak wisely!

I want to add the following to stay sane on Windows: ensure that *both*
the wide and the narrow execution character encoding is Unicode (i.e.
UTF-16 for wchar_t (that's the default) and UTF-8 for char), build with
_UNICODE defined, and link with <activeCodePage
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>.
This guarantees consistent semantics throughout the *whole* execution of
the program on reasonably recent versions of Windows. And lastly,
represent paths with std/boost filesystem paths and use APIs that know
how to deal with them *correctly*.

Similar advise applies to POSIX systems. UTF-8 everywhere is just a
recommendation but no guarantee.

Dani


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk