Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-14 03:48:43


> > Most platforms have a notion of a 'default' encoding. On Linux, the is > usually UTF-8 but isn't guaranteed to be. On Windows this is the active > local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t. > > The safest approach (and the one taken by the STL and boost) is to assume > the strings are in this OS's default encoding unless explicitly known to be > otherwise. Two problems with this approach: - Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well. - Under Windows, on the other hand you CAN NOT do everything with narrow strings. For example you can't create file "שלום-سلام-pease-Мир.txt" using char * API. And this has very bad consequences. > This means you can pass these strings around freely without > worrying about their encoding because, eventually, they get passed to an OS > call which knows how to handle them. You can't under Windows... "ANSI" API is limited. > > Alternatively, if you need to manipulate the string you can use the OS's > character conversion functions to take your default-encoding string, > convert it to something specific, manipulate the result and then convert it > back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte > with the CP_ACP flag. > CP_ACP flag can never be 65001 - UTF-8 so basically you is stuck with same problem. > HTH > > Alex > > See my mail with wider description. Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk