Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)
From: Alexander Lamaison (awl03_at_[hidden])
Date: 2011-01-13 15:30:43


On Thu, 13 Jan 2011 12:17:05 -0500, Chad Nelson wrote:

> On Thu, 13 Jan 2011 06:35:53 -0800 (PST)
> Artyom <artyomtnk_at_[hidden]> wrote:
>
> [...]
>> Notes:
>>
>> 1. You can also always assume that strings under windows are UTF-8
>> and always convert them to wide string before system calls.
>>
>> This is I think better approach, but it is different from what
>> most of boost does.
> [...]
>
> An interesting thought... I developed a set of ASCII/UTF-8/16/32
> classes for my company not too long ago, and I became fairly familiar
> with the UTF-8 encoding scheme. There was only one issue that stopped
> me from assuming that all std::string types as UTF-8-encoded: what if
> the string *isn't* meant as UTF-8 encoded, and contains characters with
> the high-bit set?
>
> There's nothing technically stopping that from happening, and there's
> no way to determine with complete certainty whether even a string that
> seems to be valid UTF-8 was intended that way, or whether the UTF-8-like
> characters are really meant as their high-ASCII values.
>
> Maybe you know something I don't, that would allow me to change it? I
> hope so, it would simplify some of the code greatly.

Most platforms have a notion of a 'default' encoding. On Linux, the is
usually UTF-8 but isn't guaranteed to be. On Windows this is the active
local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.

The safest approach (and the one taken by the STL and boost) is to assume
the strings are in this OS's default encoding unless explicitly known to be
otherwise. This means you can pass these strings around freely without
worrying about their encoding because, eventually, they get passed to an OS
call which knows how to handle them.

Alternatively, if you need to manipulate the string you can use the OS's
character conversion functions to take your default-encoding string,
convert it to something specific, manipulate the result and then convert it
back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte
with the CP_ACP flag.

HTH

Alex

-- 
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk