Boost logo

Boost :

Subject: Re: [boost] Unicode and codecvt facets
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-07-05 13:01:40


On 05/07/10 17:27, Artyom wrote:
>>
>> As some may know, I am working on a Unicode library that I plan to submit to
>> Boost fairly soon.
>>
>
> Take a look on Boost.Locale proposal.

I know of it, yes.
But my library purposely *doesn't* use the standard C++ locale subsystem
because it's slow, broken, and inflexible.
Nevertheless I want to provide the ability to bridge my library with
that system.

>
>> The codecs in that library are based around iterators and ranges, but since
>> there was some demand for support
>> for codecvt facets I am working on adapting those into that form as well.
>>
>> Unfortunately, it seems it is only possible to subclass std::codecvt<char,
>> char, mbstate_t> and
>> std::codecvt<wchar_t, char, mbstate_t>.
>
> Yes, these are actually the only specialized classes.

I was hoping I could specialize some more myself.
Some implementations appear to support using arbitrary codecvt facets
just fine, but not GCC's and MSVC's.

> More then that
> std::codecvt<char, char, mbstate_t>
> should be - "noconvert" facet.

I'm talking about types derived from these.
There is no restriction for subclasses of std::codecvt<char, char,
mbstate_t> to be non-convert, only std::codecvt<char, char, mbstate_t> is.

> You can derive from these two classes in re-implement them (like I did in
> Boost.Locale).

That's indeed what I said I can do, but as I said I find that very limiting.

> Also I strongly recommend to take a look on locale and iostreams in standard
> library if you are working with Unicode for C++.

The thing is, I'm not sure it's worth delving into it too much. On top
of being a so-so design, the popular implementations seem to all do
things differently and have different limitations.

>
>>
>> What I wonder is if there is really a point to facets, then.
>> std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would
>> be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be
>> UTF-8.
>
> Not exactly narrow encoding may be any 8-bit encoding, even something like
> Latin1 or Shift-JIS (and UTF-8 as well).

My library doesn't aim at providing code conversion from/to every
character set ever invented, which is why I just put UTF-8 in there.

Regardless I intend to allow to define a codecvt facet from any pair of
objects modeling the Converter concept; so nothing would prevent someone
from writing one or chaining them to do whatever they want, granted it
converts between char and char or wchar_t and char, since it seems there
is no way around that one.

That way you can also do normalization, case conversion or whatnot with
a codecvt facet.

> C++0x provides char16_t and char32_t to fix this standard's bug.

GCC has those types in C++0x mode, but doesn't support codecvt facets
with them.

>
>>
>> Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing
>> with UTF-8 rather than
>> maybe UTF-16 or UTF-32?
>>
>
> Ask Windows developers, they use wide strings because it is the only way to work
> correctly with their OS.

utf8_codecvt_facet is an utility provided by boost in the detail
namespace, that some libraries not particularly tied to Windows appear
to use.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk