|
Boost : |
From: Reece Dunn (msclrhd_at_[hidden])
Date: 2005-09-19 02:36:20
Vladimir Prus wrote:
> Adam Badura wrote:
>
>> I looked on a few GUI C++ libraries, but none of them satisfied me.
>> Most
>>commonly of this reasons:
>>1) weak support if at all for exceptions (wxWidgets for example)
>>2) using own classes instead of standard (most common for string)
>>(wxWidgets and MFC for example)
>
> Well, given that std::basic_string's support for Unicode is lacking, I would
> consider using custom string class an advantage of those libraries. Hey,
> you can't even convert std::string to std::wstring, and you can't construct
> std::wstring from char*. Can you convert std::wstring to any of Unicode's
> normalization forms? How portable is reading of std::wstring from a file
> with specific 8-bit encoding?
WRT the MFC/ATL/WTL libraries, they use the Win32 API calls
WideCharToMultiByte (WC2MB) and MultiByteToWideChar (MB2WC) to do the
conversion using the current thread's codepage. Likewise, their
CA2W/CW2A helper classes do a similar thing.
The WC2MB/MB2WC API allow you to pass in a specific codepage (not just
the current thread/user's). Some of these include:
UTF7 = 65000
UTF8 = 65001
UTF16 (Little Endian) = 1200
UTF16 (Big Endian) = 1200
So, you could say something like:
std::cout << unicode_cast< std::string >( russian_text, unicode::utf8 );
where, on windows, unicode_cast uses WC2MB/MB2WC and unicode::utf8 is
the UTF8 codepage (65001).
Going the other way, reading std::wstring from a file... you can detect
UTF8/16/32 (LE and BE) by having a Byte Order Mark (BOM) at the start of
the file (defined at www.unicode.org) -- this is what is done in
Windows. Then you can say:
0xFE 0xFF -- unicode::utf16be;
0xFF 0xFE -- unicode::utf16le;
0xEF 0xBB 0xBF -- unicode::utf8;
Then you could have something like:
operator>>( std::basic_istream< char > * is, std::wstring & str )
{
std::string s;
is >> s; // read in a string in its native (raw) form
str = unicode_cast< std::wstring >( s, is.unicode_format());
return is;
}
where *stream::unicode_format() returns the identified unicode form, or
some implementation-specific default value.
I am not sure about how this would work for Linux, Mac and other
operating systems, though.
- Reece
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk