Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Beman Dawes (bdawes_at_[hidden])
Date: 2012-01-29 18:25:57


On Sat, Jan 28, 2012 at 7:49 PM, Mathias Gaunard
<mathias.gaunard_at_[hidden]> wrote:
>...
>
> The way a C++ compiler is supposed to work is that all of your source is in
> the source character set, regardless of the type of string literal you use.
> Then the compiler will convert your source character set to the execution
> character set for narrow string literals, to the wide execution character
> set for wide string literals, to UTF-8 for u8 literals, etc.
>
> The correct way to portably use Unicode characters in a C++ source is to
> write it as UTF-8 and ensure that all compilers will consider the source
> character set to be UTF-8. Then use the appropriate literal types depending
> on what encoding you want your string literals to end up in.
> Of course, in the real world, it causes two practical problems:
>  - MSVC requires a BOM to be present, but GCC will choke if there is one
>  - In the lack of u8 string literals, you're stuck with wide string literals
> if you want something resembling Unicode, unless you use narrow string
> literals with just ASCII and escape sequences (\xYY, \u and \U will not work
> since it will convert)
>
> What probably should be done is that compilers should be compelled to
> support UTF-8 as the source character set in a unified way.

Makes sense to me.

Why don't you write up an issue for the C and C++ committees? My
guess it would be well received as long (1) C and C++ stay in sync (or
at least don't conflict), and (2) compiler vendors aren't required to
do anything that would prevent existing source files that work with
their compiler to no longer work. This issue might well attract
national body support, which increases the chance the committee will
take action.

It would be helpful if the issue write up included a survey of current
compilers so that committee members not familiar with various
compilers could see that UTF-8 is already widely supported modulo the
BOM issue.

Another possibility is to start lobbying compiler vendors, or at least
Microsoft, to support UTF-8 both with and without BOM.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk