Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2012-01-28 19:49:16


On 01/28/2012 08:48 PM, Yakov Galka wrote:
> The user can just write
>
> cout<< u8"您好世界";
>
> Even better is:
>
> cout<< "您好世界";
>
> which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
> and needs some trickery on others (MSVC: save as UTF-8 without BOM).

No, that's just wrong.
That's not the model that C++ uses. By not storing it with the BOM,
you're essentially tricking MSVC into believing it is ANSI (windows-1252
on western systems), and thus avoiding source character set to the
execution character set, since those happen to be the same.

The way a C++ compiler is supposed to work is that all of your source is
in the source character set, regardless of the type of string literal
you use.
Then the compiler will convert your source character set to the
execution character set for narrow string literals, to the wide
execution character set for wide string literals, to UTF-8 for u8
literals, etc.

The correct way to portably use Unicode characters in a C++ source is to
write it as UTF-8 and ensure that all compilers will consider the source
character set to be UTF-8. Then use the appropriate literal types
depending on what encoding you want your string literals to end up in.
Of course, in the real world, it causes two practical problems:
  - MSVC requires a BOM to be present, but GCC will choke if there is one
  - In the lack of u8 string literals, you're stuck with wide string
literals if you want something resembling Unicode, unless you use narrow
string literals with just ASCII and escape sequences (\xYY, \u and \U
will not work since it will convert)

What probably should be done is that compilers should be compelled to
support UTF-8 as the source character set in a unified way.

I once asked volodya if it were feasible to implement this in the build
system (add a BOM for MSVC), but he didn't seem to think it was worth it.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk