Boost logo

Boost :

Subject: [boost] Unicode and codecvt facets
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-07-05 12:03:34


As some may know, I am working on a Unicode library that I plan to
submit to Boost fairly soon.

The codecs in that library are based around iterators and ranges, but
since there was some demand for support for codecvt facets I am working
on adapting those into that form as well.

Unfortunately, it seems it is only possible to subclass
std::codecvt<char, char, mbstate_t> and std::codecvt<wchar_t, char,
mbstate_t>.
I personally don't know and understand that much about
iostreams/locales, but I have looked quickly at libstdc++'s
implementation and it doesn't seem like it is possible for std::locale
to contain any other instance of codecvt.

What I wonder is if there is really a point to facets, then.
std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset
would be UTF-16 or UTF-32 (depending on the size of wchar_t) while the
file would be UTF-8.
The problem is that wchar_t is platform-dependent and not really
reliable, so it's not really something I'd recommend to use as the
in-memory representation to deal with Unicode.

Why do people even use utf8_codecvt_facet anyway? What's wrong with
dealing with UTF-8 rather than maybe UTF-16 or UTF-32?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk