Boost logo

Boost :

Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Groke, Paul (paul.groke_at_[hidden])
Date: 2017-06-12 18:14:32

Artyom Beilis wrote:
> On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
> <boost_at_[hidden]> wrote:
> > On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
> <boost_at_[hidden]> wrote:
> >
> > AB> By definition: you can't handle file names that can't be
> > AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
> >
> > This is a nice principle to have in theory, but very unfortunate in
> > practice because at least under Unix systems such file names do occur
> > in the wild (maybe less often now than 10 years ago, when UTF-8 was
> > less ubiquitous, but it's still hard to believe that the problem has
> > completely disappeared). And there are ways to solve it, e.g. I think
> > glib represents such file names using special characters from a PUA
> > and there are other possible approaches, even if, admittedly, none of
> them is perfect.
> >
> Please note: Under POSIX platforms no conversions are performed and no
> UTF-8 validation is done as this is incorrect:

Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".

> The only case is when Windows Wide API returns/creates invalid UTF-16 -
> which can happen only when invalid surrogate
> UTF-16 pairs are generated - and they have no valid UTF-8 representation.
> On the other hand creating deliberately invalid UTF-8 is very problematic
> idea.

Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.

Boost list run by bdawes at, gregod at, cpdaniel at, john at