Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-09 15:29:52


>> To be honest I don't know what guys who designed <codecvt> in
> first place
> It was done in the early and mid 1990's, with primary input coming from
> Asian national bodies and the now long gone Unix vendors who had a big
> presence in that market.

I'm not talking about std::codecvt<> but new C++11 codecvt header
that provides utf8_codecvt - which actually useless for char16_t or
wchar_t on Windows. Because you need to use utf8_utf16_codecvt - very
unintuitive and would likely to make lots of troubles in future.

Major flaw of std::codecvt is mbstate_t that isn't well defined
makeing it impossible to work with stateful encoding or
do some composition/decomposition withing the facet.

> Header <codecvt> isn't what we need, as you point out below.
>> Boost.Locale provides one but currently it is deep internal and complex
>> part of library.
>> The code I written for Boost.Nowide or one I suggest to put into
>> Boost.Locale header-only part
>> is codecvt that converts between utf8 and utf-16/32 according to size of
>> character:
>> boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16
> (windows)
>> utf-32 (posix)
> Don't forget utf-8 to utf-8 (some embedded systems).

IAFIR std::codecvt<char,char,mbstate_t> requires it would be noconv.

Also another requirement is to actually be able to iterate over internal
character one at a time which more difficult than for UTF-16.

> IMO, a critical aspect of all of those, including utf-8 to utf-8, is that

> they detect all utf-8 errors since ill-formed utf-8 is used as an attack
> vector.
> See Markus Kuhn's

It should. Actually if you want to validate/encode/decode UTF (8/16/32)
there is boost::locale::utf::utf_traits that does it for yyou

Also it is good test to take a look on for boost.locale


Boost list run by bdawes at, gregod at, cpdaniel at, john at