Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Beman Dawes (bdawes_at_[hidden])
Date: 2015-10-09 10:17:51


> To be honest I don't know what guys who designed <codecvt> in first place
>

It was done in the early and mid 1990's, with primary input coming from
Asian national bodies and the now long gone Unix vendors who had a big
presence in that market.

thought of - I feel string influence of broken MS Unicode policies
>

This was years before Microsoft folks started to participate in the LWG.

> So I'm not going to implement C++11 <codecvt> because IMHO it is broken by
> design in first
> place.
>

Header <codecvt> isn't what we need, as you point out below.

>
> Boost.Locale provides one but currently it is deep internal and complex
> part of library.
>
> The code I written for Boost.Nowide or one I suggest to put into
> Boost.Locale header-only part
> is codecvt that converts between utf8 and utf-16/32 according to size of
> character:
>
> boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows)
> utf-32 (posix)
>

Don't forget utf-8 to utf-8 (some embedded systems).

IMO, a critical aspect of all of those, including utf-8 to utf-8, is that
they detect all utf-8 errors since ill-formed utf-8 is used as an attack
vector.

See Markus Kuhn's
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

I can contribute a Boost regression test friendly version of Kuhn's
malformed tests.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk