Boost logo

Boost :

Subject: Re: [boost] Environment Variables Library?
From: Bjørn Roald (bjorn_at_[hidden])
Date: 2015-05-23 13:37:17


On 23. mai 2015 15:50, Peter Dimov wrote:
> Bjørn Roald wrote:
>> I think encoding is going to be a challenge.
>>
>> On Posix I think you are right that one can assume the character
>> encoding is defined by the system and that may be a multi or a single
>> byte character strings, whatever is defined in the locale.
>
> On POSIX, the system doesn't care about encodings. You get from getenv
> exactly the byte string you passed to setenv.
>
>> File paths in Windows are stored in double byte character strings
>> encoded as UCS-2 which is fixed width 2 byte predecessor of UTF-16.
>
> No, file paths on Windows are UTF-16.

OK, in that case that is good. It seems it is stated that UTF-16 has
been supported since Windows 2000 in one reference I found. I must have
based my misled mind on some pretty dated information then. Possibly
also mixed up with the fact that the two encodings are so similar for
normal use that UCS-2 is often mistakenly referred to as UTF-16., So it
is hard to know for sure what statements to trust without testing. I am
glad I put a disclaimer at the top.

> I'm not quite sure how SetEnvironmentVariableA and
> SetEnvironmentVariableW interact though, I don't see it documented. The
> typical behavior for an A/W pair is for the A function to be implemented
> in terms of the W one, using the current system code page for converting
> the strings.
>
> The C runtime getenv/_putenv functions actually maintain two separate
> copies of the environment, one narrow, one wide.
>
> https://msdn.microsoft.com/en-us/library/tehxacec.aspx
>
> The problem therefore is that it's not quite possible to provide a
> portable interface.

One possible, but certainly not perfect approach is to convert in the
interface as needed from an external to the internal encoding. The
external encoding is explicitly requested by the user, or UTF-8 is
assumed. The internal encoding will always use the UTF-16 on Windows
and UTF-8 on Posix. How bad would that be?

If the Windows implementation convert to/from UTF-16 when needed and
then use Set/GetEnvironmentVariableW, then the windows back-end is taken
care of, simple enough.

However, with this scheme, on Posix systems it is harder to assure a
formal guaranty of correctness. But it is hard to see how just assuming
stored environment variables are UTF-8 are any are worst than
alternatives unless you know the variable producer used another
encoding. If you know, it is should be possible to convert anyway. Non
UTF-8 variables will likely be a less and less common problem with time.
You will still have the same abilities to recover as before with the
current Posix char* interface with no statements of expected encoding.

The external encoding (used in API parameters) can depend on the width
of the character type used in the API, the library could have functions
using both char and wchar_t based strings. The char based string
parameters assume UTF-8 and wchar_t based parameters assume UTF-16 or
UTF-32 depending on how many bit wchar_t is on the platform.

> On POSIX, programs have to use the char* functions, because they don't
> encode/decode and therefore guarantee a perfect round-trip.

Right, but I question how much value that perfect round-trip has if the
consumer have to guess the encoding. That is basically saying that I
kept the encoding, therefore I am happy even if I may have lost the
correct value.

> Using
> wchar_t* may fail if the contents of the environment do not correspond
> to the encoding that the library uses.
>
> On Windows, programs have to use the wchar_t* versions, for the same
> reason. Using char* may give you a mangled result in the case the
> environment contains a file name that cannot be represented in the
> current encoding.
>
> (If the library uses the C runtime getenv/_putenv functions, those will
> likely guarantee a perfect round-trip, but this will not solve the
> problem with a preexisting wide environment that is not representable.)
>
> Many people - me included - have adopted a programming model in which
> char[] strings are assumed to be UTF-8 on Windows, and the char[] API
> calls the wide Windows API internally, then converts between UTF-16 and
> UTF-8 as appropriate. Since the OS X POSIX API is UTF-8 based and most
> Linux systems are transitioning or have already transitioned to UTF-8 as
> default, using UTF-8 and char[] results in reasonably portable programs.

I have also followed this pattern for portable code in the past, and I
think it is a good pattern to support in a new library.

> This however doesn't appeal to people who prefer to use another
> encoding, and makes the char[] API not correspond to the Windows char[]
> API (the A functions) as those use the "ANSI code page" which can't be
> UTF-8.

I though at least some ANSI and ISO code pages where ASCII based, are
they not? Given that all values in the range 0 though 127 are the same
as in ASCII, then those encodings are just as much UTF-8 as pure ASCII
texts.

> Boost.Filesystem sidesteps the problem by letting you choose whatever
> encoding you wish. I don't particularly like this approach.

I guess it adds complexity to the API that possibly could discourage
users that only need 1 or 2 common UTF encoding(s). A separate string
conversion library could do the rest of the job when the odd encodings
are needed. Are there any other disadvantages?

--
Bjørn

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk