Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)
From: Alexander Lamaison (awl03_at_[hidden])
Date: 2011-01-14 08:47:13


> On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote:
>
> >>
> > Most platforms have a notion of a 'default' encoding. On Linux, the is
> > usually UTF-8 but isn't guaranteed to be. On Windows this is the active
> > local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.
> >
> > The safest approach (and the one taken by the STL and boost) is to assume
> > the strings are in this OS's default encoding unless explicitly known to be
> > otherwise.
>
> Two problems with this approach:
>
> - Even if the encoding under POSIX platforms is not UTF-8 you will
> be still able to open files, close them, stat on them and do any
> other operations regardless encoding as POSIX API is encoding
> agnostic, this is why it works well.

This isn't a problem, right? This is exactly why it _does_ work :D Assume
the strings are in OS-default encoding, don't mess with them, hand them to
the OS API which knows how to treat them.

> - Under Windows, on the other hand you CAN NOT do everything with narrow
> strings. For example you can't create file "שלום-سلام-pease-Мир.txt"
> using char * API. And this has very bad consequences.

This is indeed true. I was just describing the situation where the string
came from the result of one call and was being passed around. If you want
to manipulate the strings, things become more tricky.

> > This means you can pass these strings around freely without
> > worrying about their encoding because, eventually, they get passed to an OS
> > call which knows how to handle them.
>
> You can't under Windows... "ANSI" API is limited.

You've missed where I said "pass these strings around". I'm not suggesting
you can change them. But you can take a narrow string returned by an OS
call and pass it to another OS call without any problems.

> > Alternatively, if you need to manipulate the string you can use the OS's
> > character conversion functions to take your default-encoding string,
> > convert it to something specific, manipulate the result and then convert it
> > back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte
> > with the CP_ACP flag.

I ommitted one important caveat here: if you manipulate the string once
you've converted it to UTF-16, you may not be able to convert it back to
the default encoding losslessly. For example, as in your string above, if
you take the orginal string in Arabic, up-convert it and append a Russian
word, you can't blindly convert this back as the default encoding may not
be able to represent these two character sets simultaenously.

Alex

-- 
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk