|
Boost : |
Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Artyom Beilis (artyom.beilis_at_[hidden])
Date: 2017-06-16 07:00:43
On Fri, Jun 16, 2017 at 8:00 AM, Frédéric Bron <frederic.bron_at_[hidden]> wrote:
>> Please note: Under POSIX platforms no conversions are performed
>> and no UTF-8 validation is done as this is incorrect:
>>
>> http://cppcms.com/files/nowide/html/index.html#qna
>
> I do not quite understand the rationale behind not converting to UTF-8
> on Posix platforms. I naively though I got UTF-8 in argv because my
> system is convigured in UTF-8 but I discover that this is not
> necessary always the case. In the example you highlight, I do not see
> the difference from the Windows case. You could convert to UTF-8 in
> argv and back to the local encoding in nowide::remove. I understand it
> is not efficient if you do not really use the content of the filename
> but if you have to write, say an xml report in UTF-8, you would have
> to convert anyway.
>
> Today, what is the portable way to convert argv to UTF-8? i.e. without
> having to #ifdef _WIN32...?
>
> Frédéric
Hello Frederic,
There are several reasons for this.
One of is actually original purpose of the library: use same type of strings
internally without creating broken software on Windows and since only
Windows use native Wide instead of narrow API which is native for C++
only Windows case requires encoding conversion.
However there is another deeper issue. Unlike on Windows where native
wide API has well defined UTF-16 encoding it isn't the case for Unix like OSes.
The encoding is defined by current Locale that can be defined
globally, per user,
per process and even change trivially in the same process during the runtime.
There are also several sources of the Locale/Encoding information:
Environment variables:
- LANG/LC_CTYPE - which is UTF-8 on vast majority of modern Unix like
platforms but frequently can be undefined or defined as "C" locale without
encoding information.
This one is what OS defines for the process.
- C locale: setlocale API - which is by default "C" locale by standard
unless explicitly defined otherwise
- C++ locale: std::locale::global() API - which is by default "C" locale by
standard unless explicitly defined otherwise
They are all can be changed at runtime, they aren't synchronized and
they be modified to whatever encoding user wants.
Additionally using std::locale::global as not "C" locale can lead to some
really nasty things like failing to create CSV files due to adding ","
to numbers.
So the safest and the most correct way to handle it is to pass narrow
strings as is
without any conversion.
Regards,
Artyom Beilis
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk