Boost logo

Boost :

Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Groke, Paul (paul.groke_at_[hidden])
Date: 2017-06-12 19:06:20


Zach Laine wrote:
> On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
> boost_at_[hidden]> wrote:
>
> > Since the UTF-8 conversion is only done on/for Windows, and Windows
> > doesn't guarantee that all wchar_t paths (or strings in general) will
> > always be valid UTF-16, wouldn't it make more sense to just *define*
> > that the library always uses WTF-8, which allows round-tripping of all
> > possible
> > 16 bit strings? If it's documented that way it shouldn't matter.
> > Especially since users of the library cannot rely on the strings being
> > in UTF-8 anyway, at least not in portable applications.
> >
>
> I agree that round-tripping to wchar_t paths on Windows is very important.
> I also agree that not detecting invalid UTF-8, or failing to produce an error in
> such a case, is very important to *avoid*.
>
> Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in
> user-selectable string processing cases?

Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8.

In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties:
- Byte-based
- ASCII compatible (*)
- self-synchronizing
- able to 100% round-trip all characters/NUL terminated strings from the native platform encoding
And for Windows this unspecified encoding then would just happen to be WTF-8.
(* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.)

On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything.

Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application.

But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk