Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-10-30 22:54:43


On 30.10.2011 18:28, Yakov Galka wrote:
> On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach<
>> Upthread, Yakov Galka wrote:
>>>
>>> Somewhat better. But how do I get to see the whole string?
>>
>> Not with any single-byte-per-character encoding. ;-)
>
> That's why ANSI codepages other than UTF-8 are crap, they're not suitable
> for internationalization.

Nobody have suggested using Windows ANSI for internationalization.

So your use of the four-letter word "crap" is, so to speak, wasted.

>> UTF-8 is a bit problematic because the Windows support is really flaky.
>>
>
> It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8
> support, meanwhile develop workarounds that let people use UTF-8 portably.
> In 20 years we may get a working UTF-8 support. I understand that you give
> a damn about what will be in 20 years, but I do care.

Uh, four letter word again. I suggest reserving them for where they
suitably describe reality. E.g., I used a four letter word once in this
discussion, namely "hell of a lot more" about the Windows console bugs.

By the way, I can assure you that telepathy does not work:

the claimed insight into my motivations etc. is incorrect (making such a
claim is also an invalid form of rhetoric, but that's less important).

>> Still, since you're using 'wprintf', that's at the C level, so it's no
>> problem:
>
> Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when
> UTF-8 support is ALREADY THERE.

Please reserve all uppercase for macro names.

And no, I have so far not had the pleasure of learning anything
technical from this thread, unless you count the lie-to-g++ trick
applied to the visual c++ compiler, but that's more psychological.

I think that this one-sided learning, that various aspects of the
reality have apparently not been well known to Boosters, means that the
review process at Boost for this case has probably not involved the
right kind of critical people knowledgeable in the domain.

> But you know what? There's a similar
> workaround to output UTF-8 when UTF-8 is not set for the console. Now
> explain, how is this:
>
> int main()
> {
> _setmode( _fileno( stdout ), _O_U8TEXT );
> wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
> }
>
> M:\>chcp 1252
> Active code page: 1252
>
> M:\>a.exe
> Blåbærsyltetøy! 日本国 кошка!\n
>
> better than this:
>
> int main()
> {
> SetConsoleOutputCP(CP_UTF8);
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
> }

The first program, with wide string literal, does not require you to lie
to the Visual C++ compiler about the source code encoding.

The second program does require you to lie to the compiler.

Hence, (1) wide string literals with non-ASCII characters will be
mangled, cutting off use of an otherwise well defined language feature,
(2) a later version of the compiler may be able to infer the UTF-8
encoding in spite of lacking BOM, then mangling the text, (3) you have
to invoke inefficient data conversions for any use of functions that
adhere to Windows conventions, which includes most Windows libraries and
of course the Windows API -- e.g., try MessageBox, (4) you force Windows
programmers to deal with 3 text encodings (ANSI, UTF-8 and UTF-16)
instead of just 2 (ANSI and UTF-16), and (5) by "overloading" the narrow
character strings with two main character encodings, you make it easy to
introduce encoding-related bugs which can only be found by laborious
run-time testing.

So, it is an inefficient hack that can stop working, that cuts off a
well defined language feature, forces complexity and attracts bugs.

In my opinion it is not a good idea to base a Boost library on an
inefficient hack that can stop working and that cuts off a language
feature that's much used in Windows, and that on top of that forces
complexity and attracts bugs that can only be found by testing.

[snip]
> How will you explain Åshild Bjørnson why he can use plain-old printf on the
> workstations in the university but he needs to use all the w and L (or your
> proprietary Unicode-wrappers) on his private computer at home?

I would not, since that's not the case.

Also, I do not have any proprietary wrappers, that's also incorrect.

> 'w' stands for windows?

AFAIK "w" has not appeared in this thread, unless you're thinking of the
standard library's wprintf etc. I do not know what it otherwise stands
for or is. Note that using wprintf or "L" literals directly is not
portable, so if that's what you thinking of then it's a non-issue.

> Or perhaps you want to infect the non-windows world with
> wchar_t too?

I am baffled by your assumption that wchar_t is not used at all in the
*nix world.

And I am also baffled by your lack of understanding of the scheme I have
described many times. So for the record, I have not been talking about
using an unnatural representation for the platform at hand. Instead I
have argued for the opposite, namely to use the natural encoding for the
platform -- which to my mind is much of the good things that C++ is
all about, namely diversity, adaption & raw efficiency, and instead of
the Java idea of binary level portability, C++-like efficient (but less
convenient) source code level portability.

For that matter I'm also baffled by the attack with four letter word on
Windows ANSI for internationalization, which is impossible and so is not
done; it is a non-existing scheme you attacked there.

[snip]
> It's not lying. It's just not telling the truth.

To lie is to intentionally make someone believe something that one
thinks one knows is not true. One can lie by stating the truth. And in
this case, one lies by omitting a crucial fact (namely the BOM).

> And in C++11 you won't
> need it either:
>
> int main()
> {
> SetConsoleOutputCP(CP_UTF8);
> printf( u8"Blåbærsyltetøy! 日本国 кошка!\n" );
> }

Yes, this is indeed a point in favor of the UTF-8 scheme: that C++11
partially supports it.

Knowledge of the encoding is however discarded: the end result is just
an array of 'char', which unfortunately, on the Windows platform, by
convention is expected to be encoded as ANSI... -> bugs.

> Instead, the code I presented above is well-defined. The result, for my
>> program and for yours (since both output UTF-8) isn't well defined though
>> -- it depends somewhat on the phase of the moon in Seattle, or something.
>>
>
> What? It's well defined: both will write UTF-8 bytes to stdout. If you
> redirect to a file, it's well defined. If you redirect to another program,
> it's well defined. What's may not be well defined is how the reciever
> interprets this. It will break only when the receiver tries to convert the
> data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not
> restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This
> is why standardizing on UTF-8 is important.

No, I was talking about the console window support. A console window
will itself often partially mangle UTF-8 output, in particular the first
letter. At least it has done that when I have tested out the examples
for this thread. However, supporting UTF-8 more directly with e.g. a
SetConsoleOutputCP call, appears to work for direct presentation.

>> Second wrongness, it's not the only way.
>>
> You don't have stdin and wstdin. stdin has a byte oriented encoding an thus
> the only way to transfer unicode data through it is with UTF-8. If you want
> to use wprintf—good, the library will do the conversion for you. But it
> still has to be translated to UTF-8. If you don't use UTF-8 you won't be
> Unicode-compatible. If you're not Unicode compatible, that means you're
> stuck in the 20th century.

I am not sure what you're arguing here. The bit about "the only way" is
technically wrong. However, I /think/ what you're trying to communicate
is that UTF-8 is good as a kind of universal external encoding.

And if so, then I wholeheartedly agree.

However, we have been discussing internal text representation.

> âš  The importance of Unicode is not only in multilingual support, it's
> important even within one language such as English—“fiflffffiffl”… No 'ANSI'
> non-UTF-8 codepage can encode these.

Yes, you have words like "maneuver", which properly is spelled with an
oe contraction that I once mistakenly thought was a Norwegian "æ"!

> I started very near the top of this thread by giving a concrete example
>> that worked very neatly. It gets tiresome repeating myself. But as you
>> could see in that example, the end programmer does not have to deal with
>> the dirty platform-specific details any more than with all-UTF8.
>
> She does. She need to use your redundant u::sprintf when the
> narrow-character STANDARD sprintf works just fine.

Oh, the standard sprintf starts yielding incorrect results as soon as
some ANSI text has sneaked into the mix, or when Visual C++ 12 (say) has
discovered that your BOM-less source code is UTF-8 encoded.

With something like u::sprintf one is to some extent protected by having
the encoding statically type-checked.

You can say that with C++ compared to C, more and stronger static type
checking is a large part of what C++ is all about. ;-)

> It works also with the more practical codepage 1252 in the console.
>
> My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too
> as shown above.
>
> [...]
>> In my (limited) experience UTF-16 is more reliable for this.
>
> How it's more reliable?

A console window will in some cases mangle the first character of UTF-8
output. I don't know why. And the [cmd.exe] "/u" option for supporting
Unicode in pipes, is reportedly UTF-16 (disclaimer: I haven't used it).

Cheers & hth.,

- Alf

PS: Sorry that I don't have time to answer all responses.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk