Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-29 08:14:29


On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> On 28.10.2011 15:00, Peter Dimov wrote:
>
>> Yakov Galka wrote:
>>
>>> Personally I'm happy with
>>>
>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>>
>>> writing UTF-8. Even if I cannot configure the console, I still can
>>> redirect
>>> it to a file, and it will correctly save this as UTF-8.
>>>
>>
>> You can configure the console. Select Consolas or Lucida Console as the
>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>> though. :-)
>>
>
> it break a hell of a lot more than batch files. try `more`.
>
> cheers & hth.,
>
> - Alf
>
>
So I tried to make YOUR approach work (i.e. use wchar_t):

Created a file with:

#include <cstdio>
int main() {
    ::wprintf( L"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

M:\bin> a.exe
Blσbµrsyltet°y!

M:\bin> a.exe > a.txt
Contents of a.txt:
42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20

What happens to Japanese and Russian? What's the mojibake? Maybe the
compiler corrupted the string? Let's see, change to:

    wchar_t s[] = L"BlÃ¥bærsyltetøy! 日本国 кошка!\n";
    ::wprintf( s );

Recompile, step into the debugger. No. It's your favorite, correct UTF-16
that's passed to wprintf. Same result. Let's try a European codepage:

M:\bin> chcp 1252
M:\bin> a.exe
Blåbærsyltetøy!

Somewhat better. But how do I get to see the whole string?

M:\bin> chcp 65001
M:\bin> a.exe
Blbrsyltety!

M:\bin> chcp 1200
Invalid code page

OK, let's drop the requirement that the user sees the string at all. Let's
restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it
from stdin and writes verbatim to a file. Here is program b.exe:

int main() {
    wchar_t s[256];
    _getws(s);

    std::ofstream fout("out.txt", std::ios::binary);
    fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really
get
}

Compile, run.

M:\a> a.exe | b.exe

Independent of chcp I get:

42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8
00 79 00 21 00 20 00

Why the hell this is lossy‽ Where IS my lovely Japanese? What am I doing
wrong⸘

Ah! it's IMPOSSIBLE with wprintf!

Let's try UTF-8 instead. Write the program as we've written it for 40 years,
even before UTF-8 and the whole wide-char crap was introduced†.

Open VS2005:

#include <stdio.h>
int main() {
    printf("BlÃ¥bærsyltetøy! 日本国 кошка!\n");
}

† I mean the C functions used. Of course we couldn't mix Japanese and
Russian back then.

Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe.

int main() {
    char s[256];
    gets(s);
    std::ofstream fout("out.txt", std::ios::binary);
    fout.write((const char*)s, strlen(s));
}

Compile b-utf8.exe;

M:\> a-utf8.exe
Blåbærsyltetøy! 日本国 кошка!

Something is bad. [The user goes to the documentation/support. Alright, I
need UTF-8. This software is Unicode aware! Good, they care about their
customers!]:

M:\> chcp 65001
M:\> a-utf8.exe
Blåbærsyltetøy! 日本国 кошка!

Correct! (Ok, I see squares for the Japanese because I don't have a
monospace font for it, but copy/paste works correctly.)

M:\> a-utf8.exe > a.txt

a.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A

Correct!

M:\> a-utf8.exe | b-utf8.exe
M:\> type out.txt
Blåbærsyltetøy! 日本国 кошка!

out.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21

It works! MAGIC! More importantly: ***It's the only way to make it work!***

⇒ What if it's automatic and the user cannot intervene to change the
codepage?
‽ If it's automatic, then you don't care how it's displayed in the console.
You will log it to a file anyway. The case of:
M:\> a-utf8.exe | b-utf8.exe
Works correctly independent of what the current codepage was set.

⟹ more doesn't work.
‽ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft
itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are
unofficially deprecated.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspxsays:

    Note ANSI code pages can be different on different computers, or can
be changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page.

Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
options from the said above. Furthermore, if people will pester microsoft we
will get more benefit (no pun intended) than rewriting our code to use some
unknown encoding that is different on each platform.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk