Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-10-28 11:47:20


On 28.10.2011 14:41, Yakov Galka wrote:
> On Fri, Oct 28, 2011 at 13:58, Peter Dimov<pdimov_at_[hidden]> wrote:
>
>> Alf P. Steinbach wrote:
>>
>> How do I make the following program work with Visual C++ in Windows, using
>>> narrow character string?
>>>
>>> <code>
>>> #include<stdio.h>
>>> #include<fcntl.h> // _O_U8TEXT
>>> #include<io.h> // _setmode, _fileno
>>> #include<windows.h>
>>>
>>> int main()
>>> {
>>> //SetConsoleOutputCP( 65001 );
>>> //_setmode( _fileno( stdout ), _O_U8TEXT );
>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>> }
>>> </code>
>>>
>>
>> Output to a console wasn't our topic so far (and is not one of my strong
>> points), but the specific problem with this program is that the embedded
>> literal is not UTF-8, as the warning C4566 tells us, so there is no way for
>> you to get UTF-8 in the output. (You should be able to set VC++'s code page
>> to 65001, but I don't think you can.)
>>
>> int main()
>> {
>> printf( utf8_encode( L"кошка" ).c_str() );
>> }
>>
>
> You don't need to configure anything, in fact you cannot do it properly in
> VS. What you can do is:
>
> 1) don't use wide-char literals with non ascii characters
> 2) use UTF-8 literals for narrow-char.
>
> All you need is to save the source as UTF-8 WITHOUT BOM. Works as charm on
> VS2005 and VS2010. Apparently it's portable. The IDE can detect UTF-8 even
> without BOM ("☑ Auto-detect UTF-8 encoding without signature").

This is interesting in a perverse sort of way.

In order to make Visual C++ produce UTF-8 encoded compiled narrow
strings, one must /lie/ to the compiler. The source code is UTF-8. And
one lies and tells the Visual C++ compiler that it's ANSI.

And in order to make g++ produce ANSI encoded compiled narrow strings,
one must /lie/ the compiler. The source code is ANSI. And one lies and
tells the g++ compiler that it's UTF-8.

As I see it, there's something wrong here.

Notwithstanding the limitation that codepage 65000 is impractical in the
Windows command interpreter -- e.g. 'more' command CRASHES.

>> This is not a practical problem for "proper" applications because Russian
>> text literals should always come from the equivalent of gettext and never be
>> embedded in code.
>
> +1

I find that a very narrow minded view.

Would you like to be the one telling Norwegian student Åshild Bjørnson
that you favor the notion that she should waste hours or days installing
Boost and some other nix-oriented library and use 'gettext', in order to
be able to display her name in her first C++ program?

That text representation and output in C++ has been designed (with your
not just willing but enthusiastic vote) to be so inherently complex that
it requires hours and days of efforts just to display your name?

> Personally I'm happy with
>
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>
> writing UTF-8. Even if I cannot configure the console, I still can redirect
> it to a file, and it will correctly save this as UTF-8. Preventing data-loss
> is more important for me.

I find it thoroughly disgusting to have to lie to your tools, and to
rely on an assumption that the tools will not wisen up in the future.

However, I concede the point that IF one is happy with output that's
encoded so that most Windows command line tools fail (e.g. `more`
crashes), and IF one is happy with lying to the compiler about the
source encoding, and IF one is happy assuming that the compiler won't
wisen up about encodings in a future version, then -- the UTF-8 scheme
allows literals with national language characters, not just A through Z.

However, those are pretty constricting conditions.

Cheers & hth.,

- Alf


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk