Boost logo

Boost :

Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2015-10-08 16:17:40


--------------------------------------------
On Thu, 10/8/15, Peter Dimov <lists_at_[hidden]> wrote:

 Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet
 To: boost_at_[hidden]
 Date: Thursday, October 8, 2015, 10:14 PM
 
> I agree that
> this makes the most sense. I only brought up <codecvt>
> because
> if we used the standard interface
> and names we wouldn't have needed a full
> review of the hypothetical libs/codecvt.

See... lots of stuff in standard library related to Unicode is broken.
It wasn't fixed in C++11 and wouldn't be later.

Also there is deep problem with Windows API that created
Wide API and ignores any standard - both C and C++.
i.e.. there are basic files that can't even be opened on Windows
using plain C fopen or C++ std::fstream.

> As this stands, libs/utility
> seems the best bet, although I'm not overly
> fond of the practice of putting everything that
> doesn't fit elsewhere into
> Utility. :-)
> But it's better than Detail because it's documented
> and tested.
> One could make
> the case for libs/utf8 which would contain utf8_facet and
> the "obvious"
>
>     bool is_valid_utf8( string const & s );
>     wstring utf8_decode( string const & s );
>     string utf8_encode( wstring const & s );
>
> but this
> is already well into full review/bikeshed territory.

See, all this is already implemented in header only way in Boost.Locale - so no linking required.

https://github.com/boostorg/locale/blob/master/include/boost/locale/utf.hpp
https://github.com/boostorg/locale/blob/master/include/boost/locale/encoding_utf.hpp

So just call boost::locale::conv::utf_to_utf<wchar_t>("Hello World");

Full codecvt_facet for many encodings - inluding UTF-8, ISO-8859-*, Windows-125*
are already there as well

However there is very useful specific codecvt - that converts between utf8 and wchar_t/char16_t/char32_t
that can be implemented in header only without linking with big and complex Boost.Locale library.

Also I'm going to make it little bit more generic so you can implement wchar_t/char16_t/char32_t to
any stateless encoding easily (I want to improve some stuff withing Boost.Locale as well)

So utf8 codecvt facet is INTEGRAL part of Boost.Locale already - it exists there.

Just I think I'll make it more accessible to general libraries without requirement of linking
and easiler to use by users without need of special locale generation.

Ok... I decided what I'm going to do.

Next step is for other libraries to adopt this utf8_codecvt facet.

Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk