Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2012-01-29 09:28:57


On 01/29/2012 02:53 PM, Artyom Beilis wrote:

> Not, MSVC does not allow to create both "שלום" and L"שלום" literal
> as Unicode (utf-8, UTF-16) for all other compilers it is default
> behavior.

And it shouldn't.
String literals are in the execution character set. On Windows the
execution character set is what it calls ANSI. That much is not going to
change.

>>> 1. BOM should not be used in source code, no compiler except MSVC uses it
>> and most
>>> do not support it.
>>
>> According to Yakov, GCC supports it now.
>> It would be nice if it could work without any BOM though.
>>
>
> GCC's default input and literal encoding is UTF-8. BOM is not needed.

That's not what I'm saying. What we want is a unified way to set UTF-8
as the source character set.
The problem is that MSVC requires BOM, but GCC used to not allow it.

>>> 2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding
>> which
>>> makes BOM useless (crap... sory) with MSVC even more.
>>
>> That's the correct behaviour.
>
> No, it is unspecified behavior according to the standard.

It isn't.

> Standard does not specify what narrow encoding should be used, that
> is why u8"" was created.

The standard specifies that it is the execution character set. MSVC
specifies that for its implementation, the execution character set is ANSI.

> All (but MSVC) compilers create UTF-8 literals and use UTF-8 input
> and this is the default.

That's because for those other compilers, you are in a case where the
source character set is the same as the execution character set.

With MSVC, if you don't do anything, both your source and execution
character sets are ANSI. If you set your source character set to UTF-8,
your execution character set remains ANSI still.

On non-Windows platforms, UTF-8 is the most common execution character
set, so you can have a setup where source = execution = UTF-8, but you
can't do that on Windows.
But that is irrelevant to the standard.

>> Use u8 string literals if you want UTF-8.
>
> Why on earth should I do this?

Because it makes perfect sense and it's the way it's supposed to work.

> All the world around uses UTF-8. Why should I specifiy u8"" if it is
> something that can be easily defined at compiler level?

Because otherwise you're not independent from the execution character set.
Writing you program with Unicode allows you to not depend on
platform-specific encodings, that doesn't mean it makes them go away.

I repeat, narrow string literals are and will remain in the execution
character set. Expecting those to end up as UTF-8 data is wrong and not
portable.

> All we need is some flag for MSVC that tells that string
> literals encoding is UTF-8.

That "flag" is using the u8 prefix on those string literals.
Remember: the encoding used for the data in a string literal is
independent from the encoding used to write the source.

> AFAIR, neither gcc4.6 nor msvc10 supports u8"".

Unicode string literals have been in GCC since 4.5.

However there are indeed practical problems with using the standard
mechanisms because they're not always implemented.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk