|
Boost Users : |
Subject: Re: [Boost-users] handling unicode strings / encodings?
From: Maróy Ãkos (akos_at_[hidden])
Date: 2009-11-25 02:22:11
Robert,
> The xml_warchive of the serialization library uses this codecvt facet.
>
> It's very easy to use for the purpose you describe.
I looked at the code, and it really seems easy - but I have mixed
results. Most probably this is due to my lack of understanding on how to
use codecvt<>
I can convert a UCS-2 (wchar_t) string into UTF-8, if I'm converting to
a wofstream. I can also read UTF-8 and convert it to a UCS-2 (wchar_t)
string, using an wifstream. but if I try to convert to a wstringstream,
there's no conversion performed. also, I cannot make std::cout perform
the conversion.
what I'm also not understanding is that why does the target of the
conversion have to be wchar_t (wofstream, wstringstream, etc)? (see
below) after all, the point is, I'm converting from a series of wchar_t
points into a sequence of bytes, as UTF-8 is a transfer format, encoding
a number of characters as a sequence of bytes. this point is also valid
the other way around - when reading UTF-8, that would be done from a
'normal' input stream (based on char, not wchar_t), as one is decoding a
byte-based encoding into wchar_t character points.
I'm doing this on Linux 64 bit using g++ 4.3.2.
The sample string I'm exerpimenting with is "Helló", that is,
L"Hell\xf3", or 48 65 6c 6c f3 in UCS-2, and 48 65 6c 6c c3 b3 in UTF-8.
samples I'm using:
// create the utf8_locale
std::locale old_locale;
std::locale utf8_locale(old_locale,
new boost::archive::detail::utf8_codecvt_facet());
// the sample string
const std::wstring hello = L"Hell\xf3";
// try to print this as UTF-8 to std::cout
std::cout.imbue(utf8_locale);
std::cout << hello << std::endl; // this doesn't compile
// try to print this as UTF-8 to std::wcout
std::wcout.imbue(utf8_locale);
std::wcout << hello << std::endl;
// this gives "Hell?", a '?' character at the non-ASCII location
// convert into a file and back - this seems to work
{
std::wofstream ofs("hello");
ofs.imbue(utf8_locale);
ofs << hello;
}
{
std::wstring from_file;
std::wifstream ifs("hello");
ifs.imbue(utf8_locale);
ifs >> from_file;
BOOST_FOREACH(wchar_t ch, from_file) {
std::wcout << std::hex << (unsigned int) ch << L" ";
}
std::wcout << std::endl;
}
// this seems OK, the file named "hello" will contain the
// proper UTF-8 sequence, 48 65 6c 6c c3 b3, and the output is
// expected UCS-2 sequence, 48 65 6c 6c f3
// try convert to a stringstream and back
std::stringstream sstr;
sstr.imbue(utf8_locale);
sstr << hello; // this doesn't compile
// OK, try with a wide string stream
std::wstringstream sstr;
sstr.imbue(utf8_locale);
sstr << hello;
std::wcout << "The UTF-8 from: ";
BOOST_FOREACH(wchar_t ch, sstr.str()) {
std::wcout << std::hex << ((unsigned int) ch) << L" ";
}
std::wcout << std::endl;
// this gives the UCS-2 sequence, 48 65 6c 6c f3, not the
// expected UTF-8 sequence
Akos
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net