Boost logo

Boost :

Subject: [boost] [review] Review of Nowide (Unicode) continues - summary of the dicussions so far
From: Frédéric Bron (frederic.bron_at_[hidden])
Date: 2017-06-15 07:20:10


Dear all,

The formal review of Artyom Beilis' Nowide library continues until
Wed. 21st of June.

Here is a summary of the discussions so far:

7 people participate to the review (not counting Artyom and my-self):
- Degski
- Niall Douglas
- Paul Groke
- Peter Dimov
- Vadim Zeitlin
- Yakov Galka
- Zach Laine

2 official reviews have been shared, both positive for inclusion in
boost (Degski, Peter Dimov).

Discussions:

Everybody is positive about the usefullness of the library but some major
questions arose:

1. how should we handle invalid Unicode sequences (bytes or multi-bytes)? Allow?
  throws error?
2. there is an unsymmetric way of treatment between Windows and Posix:
- on Windows, the conversion from UTF-16 checks for Unicode conformance and
  returns only valid UTF-8 or fails
- on Posix, no check is performed that the input is a valid UTF-8 sequence.
3. in particular, on Windows, the roundtrip conversion UTF-16 -> UTF-8 ->
  UTF-16 is not guaranted if the initial string is non conformant

One main issue is that existing files may have non conformant names and would
therefore not been reachable by the nowide API on Windows. On Posix platforms,
non conformant paths would work transparently.

Different proposals have been made to address this:
- convert from UTF-16 to a superset of UTF-8 so that the round-tip conversion is
  possible, this would mean that the library accepts non conformant strings on
  Windows as is done on Posix. Question: which enconding should be used
  (Modified UTF-8, WTF-8, CESU-8)? WTF-8 has the issue of being difficult to
  handle with string concatenations.
- use RtlUTF8ToUnicodeN functions which replace wrong UTF-16 characters by
  U+FFFD but is this OK for round-trip?
- see glib approach
- add function to explicitly convert from wide to WTF-8

However, for the Posix case, the issue is that we cannot guarantee that the
encoding is always UTF-8 so checking for conformance may be impossible. Hence
the choice for not checking on Posix.

Minor points:
- Missing some documentation on what happens if invalid UTF-8 is provided
  (getenv, setenv, cout, cerr).
- ::setenv on cygwin gives a compile time error (Peter Dimov proposed a fix)
- suggestion to add stat and readdir

Frédéric


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk