Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-29 10:11:24


On Sun, Jan 29, 2012 at 16:28, Mathias Gaunard <mathias.gaunard_at_[hidden]
> wrote:

> On 01/29/2012 02:53 PM, Artyom Beilis wrote:
>
> Not, MSVC does not allow to create both "שלום" and L"שלום" literal
>> as Unicode (utf-8, UTF-16) for all other compilers it is default
>> behavior.
>>
>
> And it shouldn't.
> String literals are in the execution character set. On Windows the
> execution character set is what it calls ANSI. That much is not going to
> change.

Execution character set is defined by the implementation, that is the
compiler and the runtime library. It has nothing to do with the system
underneath. That is the implementation is free to decide that execution
character set is UTF-8, even though Windows narrow strings are some 'ANSI'.
Standard library interfaces then would accept UTF-8 (fopen, fstream, etc..).

> [...]
>
> 2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding
>>>>
>>> which
>>>
>>>> makes BOM useless (crap... sory) with MSVC even more.
>>>>
>>>
>>> That's the correct behaviour.
>>>
>>
>> No, it is unspecified behavior according to the standard.
>>
>
> It isn't.

As said above you can't deduce from the standard what is the "execution
character set for Windows". MSVC defines it to be 'ANSI', which is the
source of all problems. But it is unspecified behavior according to the
standard.

 Standard does not specify what narrow encoding should be used, that
>> is why u8"" was created.
>>
>
> The standard specifies that it is the execution character set. MSVC
> specifies that for its implementation, the execution character set is ANSI.

Yes, and we would like to at least have a flag that overrides the execution
character set to UTF-8.

> [...]
>
> Use u8 string literals if you want UTF-8.
>>>
>>
>> Why on earth should I do this?
>>
>
> Because it makes perfect sense and it's the way it's supposed to work.

As per C++11 it doesn't make sense to use any other narrow string literal
but u8"". Why would you use plain "" on Windows?

[...]
>
> All we need is some flag for MSVC that tells that string
>> literals encoding is UTF-8.
>>
>
> That "flag" is using the u8 prefix on those string literals.
> Remember: the encoding used for the data in a string literal is
> independent from the encoding used to write the source.

Yes, it will remain independent even with "" meaning u8"". Even if the
source character set was UTF-32 it would mean UTF-8.

Sincerely,

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk