Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-30 13:28:23


On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> [...]
>
>>
>> M:\bin> chcp 1252
>> M:\bin> a.exe
>> Blåbærsyltetøy!
>>
>> Somewhat better. But how do I get to see the whole string?
>>
>
> Not with any single-byte-per-character encoding. ;-)
>

That's why ANSI codepages other than UTF-8 are crap, they're not suitable
for internationalization.

> UTF-8 is a bit problematic because the Windows support is really flaky.
>

It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8
support, meanwhile develop workarounds that let people use UTF-8 portably.
In 20 years we may get a working UTF-8 support. I understand that you give
a damn about what will be in 20 years, but I do care.

> Still, since you're using 'wprintf', that's at the C level, so it's no
> problem:
>

Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when
UTF-8 support is ALREADY THERE. But you know what? There's a similar
workaround to output UTF-8 when UTF-8 is not set for the console. Now
explain, how is this:

int main()
{
   _setmode( _fileno( stdout ), _O_U8TEXT );
   wprintf( L"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

M:\>chcp 1252
Active code page: 1252

M:\>a.exe
Blåbærsyltetøy! 日本国 кошка!\n

better than this:

int main()
{
   SetConsoleOutputCP(CP_UTF8);
   printf( "BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

M:\>chcp 1252
Active code page: 1252

M:\>a.exe
Blåbærsyltetøy! 日本国 кошка!\n

‽

How will you explain Åshild Bjørnson why he can use plain-old printf on the
workstations in the university but he needs to use all the w and L (or your
proprietary Unicode-wrappers) on his private computer at home? 'w' stands
for windows? Or perhaps you want to infect the non-windows world with
wchar_t too?

They seem to be connected with timing or something. So UTF-8 is not good: I
> showed above how to generate UTF-8 from wide char literals just to be
> exactly comparable to your example code,

I showed you how you can continue to use UTF-8, resulting in portable code
(modulo a call to SetConsoleOutputCP) which behaves the same as yours.

> and the big difference is that I did not have to lie to the compiler and
> hope for the best.

It's not lying. It's just not telling the truth. And in C++11 you won't
need it either:

int main()
{
   SetConsoleOutputCP(CP_UTF8);
   printf( u8"BlÃ¥bærsyltetøy! 日本国 кошка!\n" );
}

Instead, the code I presented above is well-defined. The result, for my
> program and for yours (since both output UTF-8) isn't well defined though
> -- it depends somewhat on the phase of the moon in Seattle, or something.
>

What? It's well defined: both will write UTF-8 bytes to stdout. If you
redirect to a file, it's well defined. If you redirect to another program,
it's well defined. What's may not be well defined is how the reciever
interprets this. It will break only when the receiver tries to convert the
data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not
restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This
is why standardizing on UTF-8 is important.

Second wrongness, it's not the only way.
>

You don't have stdin and wstdin. stdin has a byte oriented encoding an thus
the only way to transfer unicode data through it is with UTF-8. If you want
to use wprintf—good, the library will do the conversion for you. But it
still has to be translated to UTF-8. If you don't use UTF-8 you won't be
Unicode-compatible. If you're not Unicode compatible, that means you're
stuck in the 20th century.

âš  The importance of Unicode is not only in multilingual support, it's
important even within one language such as English—“fiflffffiffl”… No 'ANSI'
non-UTF-8 codepage can encode these.

I started very near the top of this thread by giving a concrete example
> that worked very neatly. It gets tiresome repeating myself. But as you
> could see in that example, the end programmer does not have to deal with
> the dirty platform-specific details any more than with all-UTF8.
>

She does. She need to use your redundant u::sprintf when the
narrow-character STANDARD sprintf works just fine.

It works also with the more practical codepage 1252 in the console.
>

My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too
as shown above.

[...]
> In my (limited) experience UTF-16 is more reliable for this.
>

How it's more reliable?

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk