Boost logo

Boost :

Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Groke, Paul (paul.groke_at_[hidden])
Date: 2017-06-12 09:55:01

Yakov Galka wrote:

> On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost
> <mailto:boost_at_[hidden]> wrote:
>> I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
>> think it could make sense to support it anyway though. Round tripping
>> (strictly invalid, but possible) file names on Windows, easier
>> interoperability with stuff like JNI, ...
> Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes
> the null character with an overlong sequence, and thus is incompatible
> with standard UTF-8, unlike WTF-8 which is a compatible extension.

No, I mean modified UTF-8. Modified UTF-8 is UTF-8 plus the following extensions:
- Allow encoding UTF-16 surrogates as if they were code points (=what "WTF-8" does)
- Allow an over-long 2 byte encoding of the NUL character

Both are not strictly UTF-8 compatible, but both don't introduce significant
overhead in most situations. I don't see how over-long NUL encodings are
"more incompatible" then UTF-8 encoded surrogates, but then again
that's not really important.

>> OTOH it would add overhead for systems with native UTF-8 APIs, because
>> Nowide would at least have to check every string for "modified UTF-8
>> encoded" surrogate pairs and convert the string if necessary. Which of
>> course is a good argument for not supporting modified UTF-8, because
>> then Nowide could just > pass the strings through unmodified on those
>> systems.
> Implementing WTF-8 removes a check in UTF-8 -> UTF-16 conversion, and
> doesn't change anything in the reverse direction when there is a valid
> UTF-16. I suspect it isn't slower.

Supporting modified UTF-8 or WTF-8 adds overhead on systems where the
native OS API accepts UTF-8, but only strictly valid UTF-8.
When some UTF-8 enabled function of the library is called on such a
system, it would have to check for WTF-8 encoded surrogates and
convert them to "true" UTF-8 before passing the string to the OS API.
Because you would expect and want the "normal" UTF-8 encoding for
a string to refer to the same file as the WTF-8 encoding of the same

Paul Groke

Boost list run by bdawes at, gregod at, cpdaniel at, john at