Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-29 08:14:29

Next message: Thijs (M.A.) van den Berg: "Re: [boost] [math][distributions] superfluous checking of parameters?"
Previous message: Ion Gaztañaga: "Re: [boost] [interprocess] native Windows cond_var + mutex"
In reply to: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"
Next in thread: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"
Reply: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"

On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> On 28.10.2011 15:00, Peter Dimov wrote:
>
>> Yakov Galka wrote:
>>
>>> Personally I'm happy with
>>>
>>> printf( "BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!\n" );
>>>
>>> writing UTF-8. Even if I cannot configure the console, I still can
>>> redirect
>>> it to a file, and it will correctly save this as UTF-8.
>>>
>>
>> You can configure the console. Select Consolas or Lucida Console as the
>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>> though. :-)
>>
>
> it break a hell of a lot more than batch files. try `more`.
>
> cheers & hth.,
>
> - Alf
>
>
So I tried to make YOUR approach work (i.e. use wchar_t):

Created a file with:

#include <cstdio>
int main() {
::wprintf( L"BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!\n" );
}

saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

M:\bin> a.exe
BlÏƒbÂµrsyltetÂ°y!

M:\bin> a.exe > a.txt
Contents of a.txt:
42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20

What happens to Japanese and Russian? What's the mojibake? Maybe the
compiler corrupted the string? Let's see, change to:

wchar_t s[] = L"BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!\n";
::wprintf( s );

Recompile, step into the debugger. No. It's your favorite, correct UTF-16
that's passed to wprintf. Same result. Let's try a European codepage:

M:\bin> chcp 1252
M:\bin> a.exe
BlÃ¥bÃ¦rsyltetÃ¸y!

Somewhat better. But how do I get to see the whole string?

M:\bin> chcp 65001
M:\bin> a.exe
Blbrsyltety!

M:\bin> chcp 1200
Invalid code page

OK, let's drop the requirement that the user sees the string at all. Let's
restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it
from stdin and writes verbatim to a file. Here is program b.exe:

int main() {
wchar_t s[256];
_getws(s);

std::ofstream fout("out.txt", std::ios::binary);
fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really
get
}

Compile, run.

M:\a> a.exe | b.exe

Independent of chcp I get:

42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8
00 79 00 21 00 20 00

Why the hell this is lossyâ€½ Where IS my lovely Japanese? What am I doing
wrongâ¸˜

Ah! it's IMPOSSIBLE with wprintf!

Let's try UTF-8 instead. Write the program as we've written it for 40 years,
even before UTF-8 and the whole wide-char crap was introducedâ€ .

Open VS2005:

#include <stdio.h>
int main() {
printf("BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!\n");
}

â€ I mean the C functions used. Of course we couldn't mix Japanese and
Russian back then.

Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe.

int main() {
    char s[256];
    gets(s);
    std::ofstream fout("out.txt", std::ios::binary);
    fout.write((const char*)s, strlen(s));
}

Compile b-utf8.exe;

M:\> a-utf8.exe
BlÃƒÂ¥bÃƒÂ¦rsyltetÃƒÂ¸y! Ã¦â€”Â¥Ã¦Å“Â¬Ã¥â€ºÂ½ ÃÂºÃÂ¾Ã‘Ë†ÃÂºÃÂ°!

Something is bad. [The user goes to the documentation/support. Alright, I
need UTF-8. This software is Unicode aware! Good, they care about their
customers!]:

M:\> chcp 65001
M:\> a-utf8.exe
BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!

Correct! (Ok, I see squares for the Japanese because I don't have a
monospace font for it, but copy/paste works correctly.)

M:\> a-utf8.exe > a.txt

a.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A

Correct!

M:\> a-utf8.exe | b-utf8.exe
M:\> type out.txt
BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!

out.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21

It works! MAGIC! More importantly: ***It's the only way to make it work!***

â‡’ What if it's automatic and the user cannot intervene to change the
codepage?
â€½ If it's automatic, then you don't care how it's displayed in the console.
You will log it to a file anyway. The case of:
M:\> a-utf8.exe | b-utf8.exe
Works correctly independent of what the current codepage was set.

âŸ¹ more doesn't work.
â€½ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft
itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are
unofficially deprecated.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspxsays:

Note ANSI code pages can be different on different computers, or can
be changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page.

Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
options from the said above. Furthermore, if people will pester microsoft we
will get more benefit (no pun intended) than rewriting our code to use some
unknown encoding that is different on each platform.

-- 
Yakov

Next message: Thijs (M.A.) van den Berg: "Re: [boost] [math][distributions] superfluous checking of parameters?"
Previous message: Ion Gaztañaga: "Re: [boost] [interprocess] native Windows cond_var + mutex"
In reply to: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"
Next in thread: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"
Reply: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk