|
Boost : |
Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-09 15:29:52
>
>> To be honest I don't know what guys who designed <codecvt> in
> first place
>>
>
> It was done in the early and mid 1990's, with primary input coming from
> Asian national bodies and the now long gone Unix vendors who had a big
> presence in that market.
>
I'm not talking about std::codecvt<> but new C++11 codecvt header
that provides utf8_codecvt - which actually useless for char16_t or
wchar_t on Windows. Because you need to use utf8_utf16_codecvt - very
unintuitive and would likely to make lots of troubles in future.
Major flaw of std::codecvt is mbstate_t that isn't well defined
makeing it impossible to work with stateful encoding or
do some composition/decomposition withing the facet.
>
>
> Header <codecvt> isn't what we need, as you point out below.
>
>
>>
>> Boost.Locale provides one but currently it is deep internal and complex
>> part of library.
>>
>> The code I written for Boost.Nowide or one I suggest to put into
>> Boost.Locale header-only part
>> is codecvt that converts between utf8 and utf-16/32 according to size of
>> character:
>>
>> boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16
> (windows)
>> utf-32 (posix)
>>
>
> Don't forget utf-8 to utf-8 (some embedded systems).
>
IAFIR std::codecvt<char,char,mbstate_t> requires it would be noconv.
Also another requirement is to actually be able to iterate over internal
character one at a time which more difficult than for UTF-16.
> IMO, a critical aspect of all of those, including utf-8 to utf-8, is that
> they detect all utf-8 errors since ill-formed utf-8 is used as an attack
> vector.
>
> See Markus Kuhn's
> https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>
It should. Actually if you want to validate/encode/decode UTF (8/16/32)
there is boost::locale::utf::utf_traits that does it for yyou
Also it is good test to take a look on for boost.locale
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk