Boost logo

Boost :

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2005-09-19 02:36:20


Vladimir Prus wrote:
> Adam Badura wrote:
>
>> I looked on a few GUI C++ libraries, but none of them satisfied me.
>> Most
>>commonly of this reasons:
>>1) weak support if at all for exceptions (wxWidgets for example)
>>2) using own classes instead of standard (most common for string)
>>(wxWidgets and MFC for example)
>
> Well, given that std::basic_string's support for Unicode is lacking, I would
> consider using custom string class an advantage of those libraries. Hey,
> you can't even convert std::string to std::wstring, and you can't construct
> std::wstring from char*. Can you convert std::wstring to any of Unicode's
> normalization forms? How portable is reading of std::wstring from a file
> with specific 8-bit encoding?

WRT the MFC/ATL/WTL libraries, they use the Win32 API calls
WideCharToMultiByte (WC2MB) and MultiByteToWideChar (MB2WC) to do the
conversion using the current thread's codepage. Likewise, their
CA2W/CW2A helper classes do a similar thing.

The WC2MB/MB2WC API allow you to pass in a specific codepage (not just
the current thread/user's). Some of these include:
   UTF7 = 65000
   UTF8 = 65001
   UTF16 (Little Endian) = 1200
   UTF16 (Big Endian) = 1200

So, you could say something like:

std::cout << unicode_cast< std::string >( russian_text, unicode::utf8 );

where, on windows, unicode_cast uses WC2MB/MB2WC and unicode::utf8 is
the UTF8 codepage (65001).

Going the other way, reading std::wstring from a file... you can detect
UTF8/16/32 (LE and BE) by having a Byte Order Mark (BOM) at the start of
the file (defined at www.unicode.org) -- this is what is done in
Windows. Then you can say:

    0xFE 0xFF -- unicode::utf16be;
    0xFF 0xFE -- unicode::utf16le;
    0xEF 0xBB 0xBF -- unicode::utf8;

Then you could have something like:

operator>>( std::basic_istream< char > * is, std::wstring & str )
{
    std::string s;
    is >> s; // read in a string in its native (raw) form
    str = unicode_cast< std::wstring >( s, is.unicode_format());
    return is;
}

where *stream::unicode_format() returns the identified unicode form, or
some implementation-specific default value.

I am not sure about how this would work for Linux, Mac and other
operating systems, though.

- Reece


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk