Boost logo

Boost Users :

Subject: Re: [Boost-users] handling unicode strings / encodings?
From: Maróy Ákos (akos_at_[hidden])
Date: 2009-11-25 02:22:11


Robert,

> The xml_warchive of the serialization library uses this codecvt facet.
>
> It's very easy to use for the purpose you describe.

I looked at the code, and it really seems easy - but I have mixed
results. Most probably this is due to my lack of understanding on how to
use codecvt<>

I can convert a UCS-2 (wchar_t) string into UTF-8, if I'm converting to
a wofstream. I can also read UTF-8 and convert it to a UCS-2 (wchar_t)
string, using an wifstream. but if I try to convert to a wstringstream,
there's no conversion performed. also, I cannot make std::cout perform
the conversion.

what I'm also not understanding is that why does the target of the
conversion have to be wchar_t (wofstream, wstringstream, etc)? (see
below) after all, the point is, I'm converting from a series of wchar_t
points into a sequence of bytes, as UTF-8 is a transfer format, encoding
a number of characters as a sequence of bytes. this point is also valid
the other way around - when reading UTF-8, that would be done from a
'normal' input stream (based on char, not wchar_t), as one is decoding a
byte-based encoding into wchar_t character points.

I'm doing this on Linux 64 bit using g++ 4.3.2.

The sample string I'm exerpimenting with is "Helló", that is,
L"Hell\xf3", or 48 65 6c 6c f3 in UCS-2, and 48 65 6c 6c c3 b3 in UTF-8.

samples I'm using:

// create the utf8_locale
std::locale old_locale;
std::locale utf8_locale(old_locale,
            new boost::archive::detail::utf8_codecvt_facet());

// the sample string
const std::wstring hello = L"Hell\xf3";

// try to print this as UTF-8 to std::cout
std::cout.imbue(utf8_locale);
std::cout << hello << std::endl; // this doesn't compile

// try to print this as UTF-8 to std::wcout
std::wcout.imbue(utf8_locale);
std::wcout << hello << std::endl;
// this gives "Hell?", a '?' character at the non-ASCII location

// convert into a file and back - this seems to work
{
    std::wofstream ofs("hello");
    ofs.imbue(utf8_locale);
    ofs << hello;
}
{
    std::wstring from_file;
    std::wifstream ifs("hello");
    ifs.imbue(utf8_locale);
    ifs >> from_file;

    BOOST_FOREACH(wchar_t ch, from_file) {
        std::wcout << std::hex << (unsigned int) ch << L" ";
    }
    std::wcout << std::endl;
}
// this seems OK, the file named "hello" will contain the
// proper UTF-8 sequence, 48 65 6c 6c c3 b3, and the output is
// expected UCS-2 sequence, 48 65 6c 6c f3

// try convert to a stringstream and back
std::stringstream sstr;
sstr.imbue(utf8_locale);
sstr << hello; // this doesn't compile

// OK, try with a wide string stream
std::wstringstream sstr;
sstr.imbue(utf8_locale);
sstr << hello;

std::wcout << "The UTF-8 from: ";
BOOST_FOREACH(wchar_t ch, sstr.str()) {
    std::wcout << std::hex << ((unsigned int) ch) << L" ";
}
std::wcout << std::endl;
// this gives the UCS-2 sequence, 48 65 6c 6c f3, not the
// expected UTF-8 sequence

Akos


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net