Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-10-28 09:34:37


On 28.10.2011 13:31, Yakov Galka wrote:
> On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach<
> alf.p.steinbach+usenet_at_[hidden]> wrote:
>
>> On 28.10.2011 12:36, Yakov Galka wrote:
>>
>>> On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<
>>> alf.p.steinbach+usenet_at_gmail.**com<alf.p.steinbach%2Busenet_at_[hidden]>>
>>> wrote:
>>>
>>> On 27.10.2011 23:56, Peter Dimov wrote:
>>>>
>>>>>
>>>>> The advantage of using UTF-8 is that, apart from the border layer that
>>>>> calls the OS (and that needs to be ported either way), the rest of the
>>>>> code is happily char[]-based.
>>>>
>>>> Oh.
>>>>
>>>> I would be happy to learn this.
>>>>
>>>> How do I make the following program work with Visual C++ in Windows,
>>>> using
>>>> narrow character string?
>>>>
>>>>
>>>> <code>
>>>> #include<stdio.h>
>>>> #include<fcntl.h> // _O_U8TEXT
>>>> #include<io.h> // _setmode, _fileno
>>>> #include<windows.h>
>>>>
>>>> int main()
>>>> {
>>>> //SetConsoleOutputCP( 65001 );
>>>> //_setmode( _fileno( stdout ), _O_U8TEXT );
>>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>>> }
>>>> </code>
>>>>
>>>>
>>> How will you make this program portable?
>>
>> Well, that was *my* question.
>>
>> The claim that this minimal "Hello, world!" program puts to the point, is
>> that "the rest of the [UTF-8 based] code is happily char[]-based".
>>
>> Apparently that is not so.
>
> My point is that you cannot talk about things without comparison.

I think that means that I failed to communicate to you what I compared.

There was a claim that the UTF-8 based code should just work, but the
minimal hello world like code in my example does /not/ work.

Thus, it is a comparison between (1) reality, and (2) the claim, OK?

>> The out-commented code is from my random efforts to Make It Work(TM).
>>>
>>>>
>>>> It refused.
>>>>
>>>>
>>> This is because windows narrow-chars can't be UTF-8. You could make it
>>> portable by:
>>>
>>> int main()
>>> {
>>> boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
>>> }
>>>
>>
>> Thanks, TIL boost::printf.
>>
>> The idea of UTF-8 as a universal encoding seems now to be to use some
>> workaround such as boost::printf for each and every case where it turns out
>> that it doesn't work portably.
>>
>
> You pull things out of context. We should COMPARE the UTF-8 approach to the
> wide-char on windows narrow-char on non-windows approach. Your approach
> involves using your own printf just as well:
>
> #include "u/stdio_h.h" // u::CodingValue, u::printf, U
> printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL?
> u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what
> exactly U is.

The relevant difference is in my opinion between

* re-implementing
   e.g. the standard library to support UTF-8 (like boost::printf, and
   although I haven't tested the claim that it works for the program we
   discussed, it is enough for me that it /could/ work), or

* wrapping
   it with some constant time data conversions (e.g. u::printf).

The hello world program demonstrated that one or the other is necessary.

So, we can forget the earlier silly claim that UTF-8 just magically
works, and now really compare, for a simplest relevant program.

And yes, with the functionality that I sketched and coded up a demo of,
you get strong type checking and argument dependent lookup. It is
however possible to design this in e.g. C level ways where it would be
much less convenient. I think the opinions in community may have been
influenced by one particularly bad such design, the [tchar.h]... ;-)

For an UTF-16 platform a printf wrapper can simply be like this:

     inline int printf( CodingValue const* format, ... )
     {
         va_list args;
         va_start( args, format );
         return ::vwprintf( format->rawPtr(), args );
     }

The sprintf wrapper that I used in my example is more interesting, though:

     inline int sprintf( CodingValue* buffer, size_t count, CodingValue
const* format, ... )
     {
         va_list args;
         va_start( args, format );
         return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(),
args );
     }

     inline int sprintf( CodingValue* buffer, CodingValue const* format,
... )
     {
         va_list args;
         va_start( args, format );
         return ::vswprintf( buffer->rawPtr(), size_t( -1 ),
format->rawPtr(), args );
     }

The problem that the above solves is that standard vswprintf is not a
simple wchar_t version of standard vsprintf. As I recall Microsoft's
[tchar.h] relies on a compiler-specific overload, but that approach does
not cut it for platform independent code. For wchar_t/char independent
code, one solution (as above) is two offer both signatures.

Note that these wrappers do not (and do not have to) do data conversion.

Whereas re-implementations for the UTF-8 scheme have to convert data.

> but anyway you have to do O(N) work to wrap the N library functions you use.

Not quite.

It is so for the UTF-8 scheme for platform independent things such as
standard library i/o, and it is so also for the native string scheme for
platform independent things such as standard library i/o.

But when you're talking about the OS API, then with the UTF-8 scheme you
need inefficient string data conversions and N wrappers, while with the
native string scheme no string data conversions and no wrappers are
needed. Only simple "get raw pointer" calls are needed, as illustrated
in my example. Those calls could even be made implicit, but I think it's
best to have them explicit in order to avoid unexpected effects.

This difference in conversion & wrapping effort was the reason that I
used both the standard library and the OS API in my original example.

The standard library call used a thin wrapper, as shown above, while the
OS API function (MessageBoxW) could be and was called directly.

> Your approach is no way better.

I hope to convince you that the native string approach is objectively
better for portable code, for any reasonable criteria, e.g.:

* Native encoded strings avoid the inefficient string data conversions
of the UTF-8 scheme for OS API calls and for calls to functions that
follow OS conventions.

* Native encoded strings avoids many bug traps such as passing a UTF-8
string to a function expecting ANSI, or vice versa.

* Native encoded strings work seamlessly with the largest amount of code
(Windows code and nix code), while the UTF-8 approach only works
seamlessly with nix-oriented code.

Conversely, points such as those above mean that the UTF-8 approach is
objectively much worse for portable code.

In particular, the UTF-8 approach violates the principle of not paying
for what you don't (need to or want to) use, by adding inefficient
conversions in all directions; it violates the principle of least
surprise (where did that gobbledygook come from); and it violates the
KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers
to deal with 3 internal string encodings instead of just 2).

>>> You judge from a non-portable coed point-of-view. How about:
>>>
>>> #include<cstdio>
>>>
>>> #include "gtkext/message_box.h" // for gtkext::message_box
>>>
>>> int main()
>>> {
>>> char buffer[80];
>>> sprintf(buffer, "The answer is %d!", 6*7);
>>> gtkext::message_box(buffer, "This is a title!",
>>> gtkext::icon_blah_blah,
>>> ...);
>>> }
>>>
>>> And unlike your code, it's magically portable! (thanks to gtk using UTF-8
>>> on
>>> windows)
>>
>> Aha. When you use a library L that translates in platform-specific ways
>> to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
>>
>> However, try to pass a `main` argument over to gtkext::message_box.
>
> See the argv explanation in
> http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

I'm sorry, I don't see what's relevant there. You suggest there that
boost::program_options can be used if it is fixed to support UTF-8;
quote "she can use boost::program_options (assuming it's also changed to
follow the UTF-8 convention)". I think that suggestion is probably
misguided. For as far as I can see boost::program_options do not provide
any way to obtain the undamaged command line in Windows (and anyway that
command line is UTF-16 encoded). Without a portable way to obtain
undamaged program arguments, portable support for parsing them with this
encoding or that encoding seems to me to be irrelevant.

Anyway, where does this introduction of special cases end?

At every point where UTF-8 does not work, the suggested solution is to
add an inefficient data conversion and support that on all platforms.

Cheers & hth.,

- Alf


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk