Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-13 14:21:10


Hello All,

I wanted to talk about it for a loooooong time.
however never got there.

-------------------------------------------------

Proposal Summary:
===================

- We need to treat std::string, char const * as
  UTF-8 strings on Windows and drop a support of
  so called ANSI API.

- Optuional but recommended:
  
  Deprecate wide strings as unportable API.

Basics:
========

There is a big difference in handing Unicode in Windows
and POSIX platforms API. it can be summarized as following:

OS Moder Unix Modern Windows
-------------------------------------------------
char string: UTF-8 Obsolete ANSI codepage (like 1251)
wchar_t string: UTF-32 UTF-16
OS Native API: char wchar_t
Common encoding: UTF-8 UTF-16

Unicode Support Modern Unix Modern Windows
----------------------------------------------
char API Full Unicode Not supported
wchar_t API Not Exists Full

Bottom line:

You can't open or delete a file in cross plafrom way!

Suggestion:
===========

Char Strings
------------

- Under POSIX platform:

  Treat them as byte sequences with current locale,
  by default assume that they are UTF-8 as:

  a) Default Locale on most OSs is UTF-8 locale
  b) POSIX API does not care about encodings
     Even if the locale is not UTF-8 you still
     can do anything right as

- Under Windows platform:

  a) Treat them as UTF-8 strings, convert them to
     UTF-16 just before accessing system services.
  b) Never use ANSI API always use Wide API. It is
     anyway default internal encoding.

Wide String:
------------

- Deprecate them, unless you have something tied
  to Windows system API.

  a) They are not portable: no OS (except Windows)
     uses Wide strings in their API.
  b) They are not well defined: may be UTF-16 or UTF-32

  For more details read:
  
  http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

What problem this would solve for us?
=====================================

1. All standard API support Unicode naturally as it
   supposed to be.

   - Want to open boost::filesystem::fstream?
   - Want to pass parameters to other process?
   - Want to display message?
   - Want to read XML or JSON?

   All works with Unicode by default because:

   a) It is Unicode by default on Unix
   b) Because they are mapped to wide API on
      Windows.

2. Portable program should no longer worry about
   setting standard locale facets, etc.

   The program becomes much more portable.

3. Fewer bugs related to Unicode handling.

Artyom

----- Original Message ----
> Chad Nelson <chad.thecomfychair_at_[hidden]>
>
> Artyom <artyomtnk_at_[hidden]> wrote:
>
> [...]
> > Notes:
> >
> > 1. You can also always assume that strings under windows are UTF-8
> > and always convert them to wide string before system calls.
> >
> > This is I think better approach, but it is different from what
> > most of boost does.
> [...]
>
> An interesting thought... I developed a set of ASCII/UTF-8/16/32
> classes for my company not too long ago, and I became fairly familiar
> with the UTF-8 encoding scheme. There was only one issue that stopped
> me from assuming that all std::string types as UTF-8-encoded: what if
> the string *isn't* meant as UTF-8 encoded, and contains characters with
> the high-bit set?
>
> There's nothing technically stopping that from happening, and there's
> no way to determine with complete certainty whether even a string that
> seems to be valid UTF-8 was intended that way, or whether the UTF-8-like
> characters are really meant as their high-ASCII values.
>
> Maybe you know something I don't, that would allow me to change it? I
> hope so, it would simplify some of the code greatly.

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk