|
Boost : |
Subject: Re: [boost] [nowide] Easy Unicode For Windows: Request For Comments/Preliminary Review
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2012-05-31 00:11:59
----- Original Message -----
> From: Beman Dawes <bdawes_at_[hidden]>
> On Mon, May 28, 2012 at 8:33 AM, Artyom Beilis <artyomtnk_at_[hidden]>
> wrote:
>> Hello all Booster,
>>
>> I comments on a library that I want to submit for a formal review.
>>
>> The library provides an implementation of standard C and C++ library
>> functions such that their inputs are UTF-8 aware on Windows without
>> requiring using Wide API to make program work on Windows.
>
> Both the above and the docs seem to focus on the problems of UTF-8
> awareness on Windows. That's a problem well worth solving, but...
>
> Am I correct in assuming that the library allows writing portable
> programs that handle UTF-8 strings correctly on other operating
> systems, too, regardless of whether the native narrow string encoding
> is UTF-8 or something different? For example, a POSIX-like operating
> system set up to use some legacy Asian character set encoding?
>
> --Beman
>
Great Question.
No, on POSIX platforms it is actually inherently incorrect
to convert strings to/from locale encodings.
You can create, remove a file, pass it as a parameter to program
like "\xFF\xFF.txt" (invalid UTF-8) and it would work if the current
locale is UTF-8 locale. Also if you change the locale from let's say
en_US.UTF-8 to en_US.ISO-8859-1 it would not magically change all
files in OS or the strings a user may pass to the program.
(This would work on all POSIX OSs and even under Mac OS X)
POSIX OSes treat strings as NUL terminated cookies.
So altering their content according to the locale would
actually lead to incorrect behavior.
for example of I create a program "rm"
#include <cstdio.h>
int main(int argc,char **argv)
{
for(int i=1;i<argc;i++)
std::remove(argv[i]);
return 0;
}
It would work on with ANY locale and changing the strings would
lead to incorrect behavior.
The meaning of locale under POSIX platform does not have
the same effect in comparison to the locale means under
Windows platform.
Also few additional points:
- Under POSIX platform locale sometimes does not have encoding:
frequently used C locale does not actually define encoding!
- Non UTF-8 locales considered today deprecated, and it is common
practice to require that the program would run under UTF-8 locale
especially when it can be trivially changed by setting one
environment variable.
Bottom line:
1. The situation is not symmetric under POSIX platforms strings
are cookies unlike under Windows platform.
2. There are good reasons not to alter the encoding.
Artyom Beilis
P.S.: I had already send this message to the list
but it seems to be lost.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk