Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-10-29 12:21:23


On 29.10.2011 14:14, Yakov Galka wrote:
> On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach<
> alf.p.steinbach+usenet_at_[hidden]> wrote:
>
>> On 28.10.2011 15:00, Peter Dimov wrote:
>>
>>> Yakov Galka wrote:
>>>
>>>> Personally I'm happy with
>>>>
>>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>>>
>>>> writing UTF-8. Even if I cannot configure the console, I still can
>>>> redirect
>>>> it to a file, and it will correctly save this as UTF-8.
>>>>
>>>
>>> You can configure the console. Select Consolas or Lucida Console as the
>>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>>> though. :-)
>>>
>>
>> it break a hell of a lot more than batch files. try `more`.
>>
>>
> So I tried to make YOUR approach work (i.e. use wchar_t):

I am afraid that you are misrepresenting me a bit here.

But I am sure it is not intentional.

Let's walk through this.

> Created a file with:
>
> #include<cstdio>
> int main() {
> ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
> }
>
> saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

Except that <cstdio> is not guaranteed to place wprintf in the global
namespace (I commented on that before, better use <stdio.h>), that code
works OK in the sense of doing what you have specified should happen.

Which apparently is not what you think, heh.

You have specified a conversion to narrow characters using the C++
executable narrow character set, i.e. a conversion to Windows ANSI. It
surprises a lot of programmers that that's what 'wcout' does: a
NARROWING CONVERSION. It did surprise me at one time in the 1990's. I
was very disappointed. After that I have become more and more sure that
there was no design of the C++ iostreams, but that's another story...

> M:\bin> a.exe
> Blσbµrsyltet°y!

Yes -- that's what Windows ANSI Western, which you asked for, looks like
when it is presented with the original IBM PC character set, codepage
437. Switch to codepage to 1252, the codepage number for Windows ANSI,
to get the Windows ANSI result that you asked for to display properly.
Of course it will lack the Unicode-only characters:

<example>
P:\test> type jam.cpp
#include <cstdio>
int main() {
     ::wprintf( L"Blåbærsyltet├╕y! 日本国 кошка!\n" );
}

P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp
#include <cstdio>
int main() {
     ::wprintf( L"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

P:\test> jam
Bl�b�rsyltet�y!
P:\test> chcp 437
Active code page: 437

P:\test> jam
Blσbµrsyltet°y!
P:\test> chcp 1252
Active code page: 1252

P:\test> jam
Blåbærsyltetøy!
P:\test> _
</example>

[snip]
>
> M:\bin> chcp 1252
> M:\bin> a.exe
> Blåbærsyltetøy!
>
> Somewhat better. But how do I get to see the whole string?

Not with any single-byte-per-character encoding. ;-)

You can use UTF-8 or UTF-16 for the output.

UTF-8 is a bit problematic because the Windows support is really flaky.

[snip effort with wide text]
> Ah! it's IMPOSSIBLE with wprintf!

No no, you're jumping to conclusions.

The Microsoft runtime has special support for this at the C library
level, but unfortunately, as far as I know, not at the C++ level.

Still, since you're using 'wprintf', that's at the C level, so it's no
problem:

<example>
P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp
#include <stdio.h>
#include <io.h> // _setmode
#include <fcntl.h> // _O_U8TEXT

int main()
{
     _setmode( _fileno( stdout ), _O_U8TEXT );
     ::wprintf( L"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test> g++ jam.cpp
jam.cpp: In function 'int main()':
jam.cpp:7: error: '_O_U8TEXT' was not declared in this scope

P:\test> g++ jam.cpp -D __MSVCRT_VERSION__=0x0800

P:\test> a
Blåbærsyltetøy! 日本国 кошка!

P:\test> _
</example>

> Let's try UTF-8 instead.
[snip effort]

> It works! MAGIC! More importantly: ***It's the only way to make it work!***

See above.

Those statements are wrong in two important respects.

First wrongness, the Windows console window support for UTF-8 is really
really flaky, so that you get more or less arbitrary "errors". They seem
to be connected with timing or something. So UTF-8 is not good: I showed
above how to generate UTF-8 from wide char literals just to be exactly
comparable to your example code, and the big difference is that I did
not have to lie to the compiler and hope for the best. Instead, the code
I presented above is well-defined. The result, for my program and for
yours (since both output UTF-8) isn't well defined though -- it
depends somewhat on the phase of the moon in Seattle, or something.

Second wrongness, it's not the only way.

I started very near the top of this thread by giving a concrete example
that worked very neatly. It gets tiresome repeating myself. But as you
could see in that example, the end programmer does not have to deal with
the dirty platform-specific details any more than with all-UTF8.

    ---

And you absolutely don't want to work with codepage 65001 in the
console: it causes batch files and 'more' and pipes etc. to fail.

But, you may ask, what about Alf's program, then, it's the same for
heaven's sake?

Well, let's check:

<example>
P:\test> chcp 1252
Active code page: 1252

P:\test> a
Blåbærsyltetøy! 日本国 кошка!

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test>
</example>

He he. :-)

It works also with the more practical codepage 1252 in the console.

The reason is probably that it uses WriteConsole internally, but it
doesn't matter much how the runtime library accomplishes this.

On the other hand, as with much else Microsoft there are probably hidden
costs.

It is possible that invoking this C level support may wreak havoc at the
C++ iostreams level, so that a good solution may have to provide custom
iostream buffers working around the Microsoft bugs.

[snip about reporting one of the myriad console bugs, to Microsoft]
> Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
> options from the said above.

No that's incorrect.

In my (limited) experience UTF-16 is more reliable for this.

However, UTF-16 as an external encoding feels sort of wrong, even if it
is very efficient for Japanese network traffic.

> Furthermore, if people will pester microsoft we
> will get more benefit (no pun intended) than rewriting our code to use some
> unknown encoding that is different on each platform.

I believe that could greatly ease the porting of *nix tools to Windows.

Cheers & hth.,

- Alf


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk