Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-14 04:42:44


Hi,

On Thu, Jan 13, 2011 at 8:21 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> Hello All,
>
> I wanted to talk about it for a loooooong time.
> however never got there.
>
> -------------------------------------------------
>
>
> Proposal Summary:
> ===================
>
> - We need to treat std::string, char const * as
>  UTF-8 strings on Windows and drop a support of
>  so called ANSI API.
>
> - Optuional but recommended:
>
>  Deprecate wide strings as unportable API.

Fully agree. Two years ago I would very probably be advocating
some kind of TCHAR/wxChar/QChar/whatever-like character type
switching, but since then I've spent a lot of time developing portable
GUI applications and found out the hard way that it is better
to dump all the ANSI CPXXXX / UTF-XY encodings and stick
to UTF-8 and defer the conversion to whatever the native API
uses until you make the actual call.

a) UTF-16 in principle is ok but many implementations are not:
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

b) UTF-32 is basically a waste of memory for most localizations.

>
[snip]
>
> Suggestion:
> ===========
>
> Char Strings
> ------------
>
> - Under POSIX platform:
>
>  Treat them as byte sequences with current locale,
>  by default assume that they are UTF-8 as:
>
>  a) Default Locale on most OSs is UTF-8 locale
>  b) POSIX API does not care about encodings
>     Even if the locale is not UTF-8 you still
>     can do anything right as
>
> - Under Windows platform:
>
>  a) Treat them as UTF-8 strings, convert them to
>     UTF-16 just before accessing system services.
>  b) Never use ANSI API always use Wide API. It is
>     anyway default internal encoding.
>
>
> Wide String:
> ------------
>
> - Deprecate them, unless you have something tied
>  to Windows system API.

+1, IMO having two APIs that are not seamlesly interchangeble
in the code (at least with the macro trickery) is useless.
[snip]

>
> What problem this would solve for us?
> =====================================
>
> 1. All standard API support Unicode naturally as it
>   supposed to be.
>
>   - Want to open boost::filesystem::fstream?
>   - Want to pass parameters to other process?
>   - Want to display message?
>   - Want to read XML or JSON?
>
>   All works with Unicode by default because:
>
>   a) It is Unicode by default on Unix
>   b) Because they are mapped to wide API on
>      Windows.
>
> 2. Portable program should no longer worry about
>   setting standard locale facets, etc.
>
>   The program becomes much more portable.
>
> 3. Fewer bugs related to Unicode handling.
>
> Artyom
>
+1, but from my experience it is easier to say than to do.

My knowledge of Unicode and utf-8 is little more than
superficial and I didn't do a lot of char-by-char manipulation,
but to do what you are proposing we need at least some
straightforward (and efficient) way to convert the native
strings to the required encoding at the call site.

I'm not trying to nitpick on anyones implementation of
a Unicode library here but having to instantiate ~10
transcoding-related classed just to call ShellExecuteW
is not my idea of straightforward. :)

[snip]

BR, Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk