Boost logo

Boost :

Subject: Re: [boost] Review Request: Boost.Locale
From: Gevorg Voskanyan (v_gevorg_at_[hidden])
Date: 2010-05-24 13:57:35


Artyom wrote:
> - There is absolutely no information given about std::mbstate_t that
> should save intermediate data between conversions so, there is actually
> no way to pass anything between sequential calls of
> std::locale::codecvt<...>::in/out. So even if I observe first surrogate
> pair there is no way to pass this information for next call and thus
> I loose this information

Ah, yes, mbstate_t. It may be good enough for UTF-8 (multibyte sequence) but may not be usable for UTF-16 (multi-wchar_t sequence :-) on windows). Thanks, that fully explains it.

> This is exactly the reason you can't implement utf-8 - utf-16 codepage
> conversion using codecvt facet.

And still codecvt<char16_t, char, mbstate_t> converts between UTF-8 and UTF-16 in C++11. That seems to suggest the new standard will require mbstate_t to be usable for UTF-16 as well.

> On the other hand there is no such limitations for utf-32 encodings
> as there is no information to preserve between calls.
>
> Additional note: it is also not possible to convert statefull encodings
> like UTF-7 as there is no way to move state around.
>
> So generally std::locale::codecvt is not well designed to be derived
> from, so only way to to stream conversion correctly is redesign this
> facet, but in such case you can't use it with std::iostreams library.

Yes, I see.

> >
> > For the original (non-compliance) point I raised it would
> > be interesting to see how well codecvt< char32_t, char,
> > std::mbstate_t > is going to be implemented under windows
> > :)
>
> There is no problem to implement it correctly.

My point is that, if that is implemented correctly, then strictly speaking an implementation where sizeof(wchar_t) == 16 will become non-conforming according to 3.9.1/5. Which would be interesting to see :)
As intended by the standard wchar_t should have at least 21 bits for C++ implementations supporting Unicode, but of course that isn't going to be fixed for windows compilers in the foreseeable future.

> >
> > BTW, I see some interesting additions to codecvts in n3090,
> > 22.5.
> > Any plans to implement them in Boost.Locale?
>
> On same wave, when char32_t/char16_t would be available, hopefully
> these facets would be implemented. But today it is impossible to
> implement utf-16 codecvt facets.

You're right, implementing them would require implementation-specific knowledge about std::mbstate_t.

> My personal opinion - avoid wide characters and any "Unicode"
> characters. Because it is best way to full yourself with "Unicode"
> support as in reality they do not provide any advantage over plain
> char and utf-8 encodings.
>
> So, unless you are using Win32 API avoid wide characters.
> However too many programmers would disagree with me, epsecially
> Windows programmers who grew on "Unicode" and "Wide" API.
> So Boost.Locale fully supports wide characters.

Despite having started as a Windows programmer myself, I don't disagree with you on this point. On the contrary, I've always been uncomfortable with windows' A/W API, and would've much preferred UTF-8 instead, as is the case in the *nix world. Another reason I am forced still to use wide characters is wxwidgets, which (in its 2.x releases) assumes ANSI unless wxUSE_UNICODE is defined to non-zero value, in which case it uses wide characters in its API, essentially following the windows model. Fortunately, this is going to change in soon-to-be-released wxwidgets 3.0, which will have UTF-8 interface.

> >
> > Non-iterator interface is a real pain in using codecvt, I
> > admit.
>
> I think best interface would be rather something like boost::iostreams
> filter but I think this should be rather part of iostreams library
> then localization. Also it should not pass wide encoding in the middle
> when converting utf-8 to ISO-8859-8.
>
> But that is different story.
>
> For simple string conversion boost::locale provides from_utf/to_utf
> that work correctly with utf-8/16/32.

Looking forward to Boost.Locale review!

> Artyom

Artyom, thank you very much for providing your insightful ideas satisfying my curiosity!

Best Regards,
Gevorg


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk