Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-29 08:14:29

On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> On 28.10.2011 15:00, Peter Dimov wrote:
>> Yakov Galka wrote:
>>> Personally I'm happy with
>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>> writing UTF-8. Even if I cannot configure the console, I still can
>>> redirect
>>> it to a file, and it will correctly save this as UTF-8.
>> You can configure the console. Select Consolas or Lucida Console as the
>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>> though. :-)
> it break a hell of a lot more than batch files. try `more`.
> cheers & hth.,
> - Alf
So I tried to make YOUR approach work (i.e. use wchar_t):

Created a file with:

#include <cstdio>
int main() {
    ::wprintf( L"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );

saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

M:\bin> a.exe

M:\bin> a.exe > a.txt
Contents of a.txt:
42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20

What happens to Japanese and Russian? What's the mojibake? Maybe the
compiler corrupted the string? Let's see, change to:

    wchar_t s[] = L"BlÃ¥bærsyltetøy! 日本国 кошка!\n";
    ::wprintf( s );

Recompile, step into the debugger. No. It's your favorite, correct UTF-16
that's passed to wprintf. Same result. Let's try a European codepage:

M:\bin> chcp 1252
M:\bin> a.exe

Somewhat better. But how do I get to see the whole string?

M:\bin> chcp 65001
M:\bin> a.exe

M:\bin> chcp 1200
Invalid code page

OK, let's drop the requirement that the user sees the string at all. Let's
restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it
from stdin and writes verbatim to a file. Here is program b.exe:

int main() {
    wchar_t s[256];

    std::ofstream fout("out.txt", std::ios::binary);
    fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really

Compile, run.

M:\a> a.exe | b.exe

Independent of chcp I get:

42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8
00 79 00 21 00 20 00

Why the hell this is lossy‽ Where IS my lovely Japanese? What am I doing

Ah! it's IMPOSSIBLE with wprintf!

Let's try UTF-8 instead. Write the program as we've written it for 40 years,
even before UTF-8 and the whole wide-char crap was introduced†.

Open VS2005:

#include <stdio.h>
int main() {
    printf("BlÃ¥bærsyltetøy! 日本国 кошка!\n");

† I mean the C functions used. Of course we couldn't mix Japanese and
Russian back then.

Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe.

int main() {
    char s[256];
    std::ofstream fout("out.txt", std::ios::binary);
    fout.write((const char*)s, strlen(s));

Compile b-utf8.exe;

M:\> a-utf8.exe
Blåbærsyltetøy! 日本国 кошка!

Something is bad. [The user goes to the documentation/support. Alright, I
need UTF-8. This software is Unicode aware! Good, they care about their

M:\> chcp 65001
M:\> a-utf8.exe
Blåbærsyltetøy! 日本国 кошка!

Correct! (Ok, I see squares for the Japanese because I don't have a
monospace font for it, but copy/paste works correctly.)

M:\> a-utf8.exe > a.txt

42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A


M:\> a-utf8.exe | b-utf8.exe
M:\> type out.txt
Blåbærsyltetøy! 日本国 кошка!

42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21

It works! MAGIC! More importantly: ***It's the only way to make it work!***

⇒ What if it's automatic and the user cannot intervene to change the
‽ If it's automatic, then you don't care how it's displayed in the console.
You will log it to a file anyway. The case of:
M:\> a-utf8.exe | b-utf8.exe
Works correctly independent of what the current codepage was set.

⟹ more doesn't work.
‽ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft
itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are
unofficially deprecated.

    Note ANSI code pages can be different on different computers, or can
be changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page.

Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
options from the said above. Furthermore, if people will pester microsoft we
will get more benefit (no pun intended) than rewriting our code to use some
unknown encoding that is different on each platform.


Boost list run by bdawes at, gregod at, cpdaniel at, john at