Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-08 10:54:40

Next message: Artyom Beilis: "Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet"
Previous message: Seth: "Re: [boost] [interprocess] How to know how much memory is taken up by an object allocated in boost interprocess shared memory?"
In reply to: Peter Dimov: "Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet"
Next in thread: Artyom Beilis: "Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet"
Reply: Robert Ramey: "Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet"
Reply: Peter Dimov: "Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet"
Reply: Beman Dawes: "Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet"

----- Original Message -----

> From: Peter Dimov <lists_at_[hidden]>
> Artyom Beilis wrote:
>
>> We can create a "Separate" codecvt library with its own formal
> review and
>> it would be ready in best case in a year...
>
> One option is to put it into utility; another is to use a mini-review if the
> new codecvt library is an implementation of the standard <codecvt>
> interface.
>
> std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on
> the other hand, from your previous message it seems that your
> utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or
> perhaps it's the latter when wchar_t is 16 bit and the former when it's
> 32
> bit.
>

[BEGIN: Long description regarding <codecvt> ]

To be honest I don't know what guys who designed <codecvt> in first place
thought of - I feel string influence of broken MS Unicode policies

std::codecvt_utf8 is actually quite misleading - it converts between utf8 and ucs-2/ucs-4 i.e.
using it under windows with wchar_t you wouldn't get support of utf-16 at all. It basically does
what boost::XXX:utf8_codecvt_facet does for std::codecvt_utf8<wchar_t>.

Basically broken and useless as UCS-2 is subset of proper encoding.

Now <codecvt>'s std::codecvt_mode is clear Microsoftism as for example using UTF-8
BOM is one of the many Unicode crimes Microsoft created - as storing UTF-16 files
on disk.

Another hilarious stuff is Maxcode = 0x10ffff template parameter for the facet...

It is like creating

template<double Pi_Value=3.14159>
class circle;

0x10FFFF IS max value for Unicode codepoint, not 0xFFFF not anything else.

std::codecvt_utf16 is an attempt to build "narrow" utf-16 encoding, just no comment...

std::codecvt_utf8_utf16 is actually useful under windows and does what it is supposed to
to with wchar_t... but under POSIX platform it is impossible to use std::codecvt_utf8_utf16
with wchar_t because wchar_t is UTF-32...

So if you want to install utf8 to wchar_t codecvt facet that represents utf-16 or utf-32 according
to platform you need to use

if(sizeof(wchar_t) == 2)
return new std::codecvt_utf8_utf16<wchar_t>();
else // sizeof(wchar_t) == 4
return new std::codecvt_utf8<wchar_t>();
So all <codecvt> was built wrong under strong Microsoft development policy influences and useless
for any cross platform development.

So... Boost community - please give yourself a favor Don't use <codecvt> unless you really
understand what are you doing.

[END: Long description regarding <codecvt> ]

If you want to covert utf8 files properly to native wide character like for example for boost::filesystem,

boost::serialization or std::fstream you need to use facet that converts to utf-16 or utf-32
according to what wchar_t holds and <codecvt> does not provide one (without platform specific tricks)

So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first
place.

Boost.Locale provides one but currently it is deep internal and complex part of library.

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part
is codecvt that converts between utf8 and utf-16/32 according to size of character:

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix)
boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform
boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform

That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed.

Artyom Beilis

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk