|
Boost : |
Subject: Re: [boost] [General] Treat narrow strings as UTF-8 (compilation flag)
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2011-07-22 03:23:12
Hello All,
I can suggest following policy.
- Boost must deprecate use of ANSI API on Windows anywhere
- Boost must use only Wide API explicitly
- Boost must treat all narrow strings as UTF-8 regardless
 the fact it is not compatible with _some_ other software
 that uses ANSI encoding and convert them to Wide onces.
 To make things simpler the conversion should be done only
 on the last stage - close to OS system calls/C library calls like
 CreateFileW or _wfopen, _wremove
- I think where it is possible to have an optional backward compatibility
 build/compilation flag like BOOST_WINDOWS_USE_ANSI_ENCODING
 For thous who want to stick with old API with compatibility
And I want to explain why keeping using ANSI API is still
not compatible and will remain not-compatible even
withing existing software.
----------------------------------------------------
----------------------------------------------------
ANSI/Narrow API is not compatible with itself, there
are several places where encoding is defined and
it is used differently in different places even withing
the native Windows software like Visual Studio itself.
----------------------------------------------------
----------------------------------------------------
For example, this program does not do what is expected
when is compiled with Microsoft Visual Studio 2008/2010
  1 setlocale(LC_ALL,"Russian_Russia.1251") // Set Russian Locale
  2 std::ofstream text("ÐиÑ.txt"); // encoded as 1251
    text << "Hello" << std::endl;
    text.close();
  3 std::remove("ÐиÑ.txt"); // 1251
1. Set the global C locale and encoding to Russian and sets the
  code page to 1251 - Cyrillic encoding
2. text stream is being opened. "ÐиÑ.txt" is converted from
  CP1251 to UTF-16 and file is created
3. std::remove converts "ÐиÑ.txt" to UTF-16 according to OS ANSI
  code page - it may not be the same code page as was set in (1)
  So the file remains on the system and not got removed
  Because two different parts of same program use different
  narrow encodings.
And this happens withing the same runtime and same compiler!
---------------------------------------------------------------
1. ANSI API Must be deprecated
2. UTF-8 should be used by default.
Many libraries around had adopted this policy on windows
as ASNI encoding keeps us behind and makes cross platform
programming nightmare.
Example of some libraries that adopted UTF-8 on Windows
1. GTK/GTKmm
2. Sqlite3
3. Boost.Locale - UTF-8 policy was very welcoming by many
  reviewers
I'd put more libraries into this list but it not comes
to my mind right now.
I'd suggest to make this policy as official Boost policy
and bring it to the formal review.
-----------------------------------------------------------
I'm personally would write patches for Boost libraries
that still use ANSI API and fix them if required.
Yakov - I would be with your on this because current
windows/unicode situation is very bad in Boost.
------------------------------------------------------------
Artyom Beilis
--------------
CppCMS - C++ Web Framework: http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
----- Original Message -----
> From: Yakov Galka <ybungalobill_at_[hidden]>
> To: boost_at_[hidden]
> Cc:
> Sent: Friday, July 22, 2011 9:49 AM
> Subject: Re: [boost] [General] Treat narrow strings as UTF-8 (compilation flag)
>
> Hello again,
>
> My previous mail was ignored by the community, and I would like to know why.
> If it wasn't clear, I want to hear your opinion on the topic.
>
> If there is a disagreement, I would like to know what is the reason for the
> disagreement. If there are problems in the proposal, perhaps we can fix them
> and come to a solution accepted by all.
>
> If you agree in principle but just don't have the resources for this work,
> I'm going to do this work (or part of it). I just don't want to waste my
> time on something that is certainly going to be rejected.
>
> Thank you in advance,
> --
> Yakov Galka
>
>
>
> On Tue, Jul 5, 2011 at 19:25, Yakov Galka <ybungalobill_at_[hidden]> wrote:
>
>> Hello All,
>>
>> About half a year ago there was a long discussion titled "Always treat
>> std::strings as UTF-8". The only objection to the proposal was that
> making
>> an instant switch by assuming UTF-8 by default will give surprising results
>> to those who're unaware of the convention (or prefer using legacy
> encodings
>> instead of UTF-8). This applies almost only to Windows developers. However,
>> there are already many projects and organizations that switched to UTF-8
>> even for Windows programming. The company I work in is one of them.
>>
>>
>> Nowadays:
>> ==========
>>
>> All the libraries that accept narrow strings assume the system encoding by
>> default.
>> * filesystem::path â Can be configured through static imbue() function.
>> * system_error_category (windows error description), interprocess (object
>> names)... more? â Don't support Unicode at all. They use the narrow API
> on
>> Windows.
>> * program_options â Assumes UTF-8 for internal data (Good!), but uses
>> system encoding for paths (parse_config_file) and for environment variables
>> (Bad...) .
>>
>> Note that, e.g. path::imbue(), is a painful solution for two reasons:
>> Any global state initialization is problematic in dynamically-linked,
>> multi-threaded systems (like the one I'm maintaining now). In such
> cases a
>> compile time configuration is more attractive.
>> I really don't want to have such a function in each boost library (can
> be
>> solved by having a global boost::imbue though).
>>
>>
>> Proposal:
>> ========
>>
>> Add a compile-time configuration flag that causes boost to treat all narrow
>> strings as UTF-8. The flag will be off by default.
>> For example, in filesystem it's a matter of setting `codepage` to
> CP_UTF8
>> in just two places.
>>
>>
>> Rationale:
>> ==========
>>
>> Those who are ready to move to the UTF-8 future, they can do it by simply
>> setting a compilation flag..
>> Those who don't care about Unicode correctness are not affected by the
>> addition. There won't be any complaints to boost, like: "Hey! I
> use boost
>> with these libraries and it doesn't work. Your encoding is
> wrong!".
>>
>>
>>
>> --
>> Yakov Galka
>>
>>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk