Boost logo

Boost :

Subject: Re: [boost] Environment Variables Library?
From: Bjørn Roald (bjorn_at_[hidden])
Date: 2015-05-23 08:07:39


On 23. mai 2015 02:18, Michael Ainsworth wrote:
> On 22 May 2015, at 8:21 pm, Klaim - Joël Lamotte <mjklaim_at_[hidden]> wrote:
>
>> ​By the way, what would be the encoding of the strings returned by or
>> passed to the Environment library?
>
> Given that std::getenv returns a char*, I think the library should
> work with std::string, although we did discuss supporting
> std::wstring using templates. Whether std::string is encoded in ASCII
> or UTF8 would be an OS specific thing I imagine.
>
> Someone with more experience with character encodings might want to
> weigh in here.

[Michael, I took the liberty of rearranging you response a bit as you
are top posting, see http://www.boost.org/community/policy.html]

Disclaimer: I am no character encoding expert, so take care to verify
claims by me here.

I think encoding is going to be a challenge.

On Posix I think you are right that one can assume the character
encoding is defined by the system and that may be a multi or a single
byte character strings, whatever is defined in the locale. As the Posix
getenv, setenv functions are simply char* based with no statements on
encoding, it is possible to let the system determine the encoding.
UTF-8 will likely be used for UNICODE support, as other options make
little sense.

On Windows however there are variants of the windows API for environment
variables:

BOOL WINAPI SetEnvironmentVariable(
   _In_ LPCTSTR lpName,
   _In_opt_ LPCTSTR lpValue
);

Unicode and ANSI names
SetEnvironmentVariableW (Unicode) and
SetEnvironmentVariableA (ANSI)

The regular SetEnvironmentVariable use LPCTSTR, and according to

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383751%28v=vs.85%29.aspx

LPCTSTR is an LPCWSTR if UNICODE is defined, an LPCSTR otherwise.

#ifdef UNICODE
  typedef LPCWSTR LPCTSTR;
#else
  typedef LPCSTR LPCTSTR;
#endif

File paths in Windows are stored in double byte character strings
encoded as UCS-2 which is fixed width 2 byte predecessor of UTF-16.
Other string data may not be double byte character strings, and ASCII
and ANSI strings will certainly exist in C++ code. Nevertheless it seems
the conversions should happen when the API is setting or getting the
variables. I am not sure how these Unicode and ANSI name variants of
the API functions interact with the actual storage of the variables in
the environment block, but it make sense that code need to use them to
convert when needed from program code when a conversion is needed. A
standard C++ library need to facilitate for these conversions as well. I
am not sure how that is best done, but I can imagine the
Boost.Filesystem library have considered options for a very similar problem.

As the UNICODE macro determine if your Windows program have single or
double byte characters in its environment block with ANSI or UNICODE
UCS-2 value encoding respectively, a conversion may be needed when
creating child processes. The CreateProcess function seems to support
that, see the section on the lpEnvironment argument here
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx

It is annoying that Microsoft ended up using UCS-2. Other operating
systems waited a bit longer to decide how to support UNICODE I think and
thus had a better option available with UTF-8. But the situation is
what it is and we have to deal with it.

--
Bjørn

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk