Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-08 10:54:40


----- Original Message -----

> From: Peter Dimov <lists_at_[hidden]>
> Artyom Beilis wrote:
>
>> We can create a "Separate" codecvt library with its own formal
> review and
>> it would be ready in best case in a year...
>
> One option is to put it into utility; another is to use a mini-review if the
> new codecvt library is an implementation of the standard <codecvt>
> interface.
>
> std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on
> the other hand, from your previous message it seems that your
> utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or
> perhaps it's the latter when wchar_t is 16 bit and the former when it's
> 32
> bit.
>

[BEGIN: Long description regarding <codecvt> ]

To be honest I don't know what guys who designed <codecvt> in first place
thought of - I feel string influence of broken MS Unicode policies

std::codecvt_utf8 is actually quite misleading - it converts between utf8 and ucs-2/ucs-4 i.e.
using it under windows with wchar_t you wouldn't get support of utf-16 at all. It basically does
what boost::XXX:utf8_codecvt_facet does for std::codecvt_utf8<wchar_t>.

Basically broken and useless as UCS-2 is subset of proper encoding.

Now <codecvt>'s std::codecvt_mode is clear Microsoftism as for example using UTF-8
BOM is one of the many Unicode crimes Microsoft created - as storing UTF-16 files
on disk.

Another hilarious stuff is Maxcode = 0x10ffff template parameter for the facet...

It is like creating

template<double Pi_Value=3.14159>
class circle;

0x10FFFF IS max value for Unicode codepoint, not 0xFFFF not anything else.

std::codecvt_utf16 is an attempt to build "narrow" utf-16 encoding, just no comment...

std::codecvt_utf8_utf16 is actually useful under windows and does what it is supposed to
to with wchar_t... but under POSIX platform it is impossible to use std::codecvt_utf8_utf16
with wchar_t because wchar_t is UTF-32...

So if you want to install utf8 to wchar_t codecvt facet that represents utf-16 or utf-32 according
to platform you need to use

if(sizeof(wchar_t) == 2)
   return new std::codecvt_utf8_utf16<wchar_t>();
else // sizeof(wchar_t) == 4
   return new std::codecvt_utf8<wchar_t>();
So all <codecvt> was built wrong under strong Microsoft development policy influences and useless
for any cross platform development.

So... Boost community - please give yourself a favor Don't use <codecvt> unless you really
understand what are you doing.

[END: Long description regarding <codecvt> ]

If you want to covert utf8 files properly to native wide character like for example for boost::filesystem,

boost::serialization or std::fstream you need to use facet that converts to utf-16 or utf-32
according to what wchar_t holds and <codecvt> does not provide one (without platform specific tricks)

So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first
place.

Boost.Locale provides one but currently it is deep internal and complex part of library.

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part
is codecvt that converts between utf8 and utf-16/32 according to size of character:

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix)
boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform
boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform

That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed.

Artyom Beilis


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk