|
Boost : |
Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-08 10:54:40
----- Original Message -----
> From: Peter Dimov <lists_at_[hidden]>
> Artyom Beilis wrote:
>
>> We can create a "Separate" codecvt library with its own formal
> review and
>> it would be ready in best case in a year...
>
> One option is to put it into utility; another is to use a mini-review if the
> new codecvt library is an implementation of the standard <codecvt>
> interface.
>
> std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on
> the other hand, from your previous message it seems that your
> utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or
> perhaps it's the latter when wchar_t is 16 bit and the former when it's
> 32
> bit.
>
[BEGIN: Long description regarding <codecvt> ]
To be honest I don't know what guys who designed <codecvt> in first place
thought of - I feel string influence of broken MS Unicode policies
std::codecvt_utf8 is actually quite misleading - it converts between utf8 and ucs-2/ucs-4 i.e.
using it under windows with wchar_t you wouldn't get support of utf-16 at all. It basically does
what boost::XXX:utf8_codecvt_facet does for std::codecvt_utf8<wchar_t>.
Basically broken and useless as UCS-2 is subset of proper encoding.
Now <codecvt>'s std::codecvt_mode is clear Microsoftism as for example using UTF-8
BOM is one of the many Unicode crimes Microsoft created - as storing UTF-16 files
on disk.
Another hilarious stuff is Maxcode = 0x10ffff template parameter for the facet...
It is like creating
template<double Pi_Value=3.14159>
class circle;
0x10FFFF IS max value for Unicode codepoint, not 0xFFFF not anything else.
std::codecvt_utf16 is an attempt to build "narrow" utf-16 encoding, just no comment...
std::codecvt_utf8_utf16 is actually useful under windows and does what it is supposed to
to with wchar_t... but under POSIX platform it is impossible to use std::codecvt_utf8_utf16
with wchar_t because wchar_t is UTF-32...
So if you want to install utf8 to wchar_t codecvt facet that represents utf-16 or utf-32 according
to platform you need to use
if(sizeof(wchar_t) == 2)
return new std::codecvt_utf8_utf16<wchar_t>();
else // sizeof(wchar_t) == 4
return new std::codecvt_utf8<wchar_t>();
So all <codecvt> was built wrong under strong Microsoft development policy influences and useless
for any cross platform development.
So... Boost community - please give yourself a favor Don't use <codecvt> unless you really
understand what are you doing.
[END: Long description regarding <codecvt> ]
If you want to covert utf8 files properly to native wide character like for example for boost::filesystem,
boost::serialization or std::fstream you need to use facet that converts to utf-16 or utf-32
according to what wchar_t holds and <codecvt> does not provide one (without platform specific tricks)
So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first
place.
Boost.Locale provides one but currently it is deep internal and complex part of library.
The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part
is codecvt that converts between utf8 and utf-16/32 according to size of character:
boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix)
boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform
boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform
That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed.
Artyom Beilis
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk