Boost logo

Boost :

Subject: Re: [boost] Environment Variables Library?
From: Peter Dimov (lists_at_[hidden])
Date: 2015-05-23 09:50:13


Bjørn Roald wrote:
> I think encoding is going to be a challenge.
>
> On Posix I think you are right that one can assume the character encoding
> is defined by the system and that may be a multi or a single byte
> character strings, whatever is defined in the locale.

On POSIX, the system doesn't care about encodings. You get from getenv
exactly the byte string you passed to setenv.

> File paths in Windows are stored in double byte character strings encoded
> as UCS-2 which is fixed width 2 byte predecessor of UTF-16.

No, file paths on Windows are UTF-16.

I'm not quite sure how SetEnvironmentVariableA and SetEnvironmentVariableW
interact though, I don't see it documented. The typical behavior for an A/W
pair is for the A function to be implemented in terms of the W one, using
the current system code page for converting the strings.

The C runtime getenv/_putenv functions actually maintain two separate copies
of the environment, one narrow, one wide.

https://msdn.microsoft.com/en-us/library/tehxacec.aspx

The problem therefore is that it's not quite possible to provide a portable
interface.

On POSIX, programs have to use the char* functions, because they don't
encode/decode and therefore guarantee a perfect round-trip. Using wchar_t*
may fail if the contents of the environment do not correspond to the
encoding that the library uses.

On Windows, programs have to use the wchar_t* versions, for the same reason.
Using char* may give you a mangled result in the case the environment
contains a file name that cannot be represented in the current encoding.

(If the library uses the C runtime getenv/_putenv functions, those will
likely guarantee a perfect round-trip, but this will not solve the problem
with a preexisting wide environment that is not representable.)

Many people - me included - have adopted a programming model in which char[]
strings are assumed to be UTF-8 on Windows, and the char[] API calls the
wide Windows API internally, then converts between UTF-16 and UTF-8 as
appropriate. Since the OS X POSIX API is UTF-8 based and most Linux systems
are transitioning or have already transitioned to UTF-8 as default, using
UTF-8 and char[] results in reasonably portable programs.

This however doesn't appeal to people who prefer to use another encoding,
and makes the char[] API not correspond to the Windows char[] API (the A
functions) as those use the "ANSI code page" which can't be UTF-8.

Boost.Filesystem sidesteps the problem by letting you choose whatever
encoding you wish. I don't particularly like this approach.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk