Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-30 14:54:24

Next message: Ion Gaztañaga: "Re: [boost] [interprocess] native Windows cond_var + mutex"
Previous message: Peter Dimov: "Re: [boost] Boost.Algorithm design question"
In reply to: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows"
Next in thread: Peter Dimov: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"

On Fri, Oct 28, 2011 at 15:34, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> [...]
> There was a claim that the UTF-8 based code should just work,

I can't recall anyone saying this. What people were saying is that it's the
most sane way to write portable code. And if the vendors hadn't been
resisting UTF-8 adoption, it would just work.

> [...]
> * re-implementing
> e.g. the standard library to support UTF-8 (like boost::printf, and
> although I haven't tested the claim that it works for the program we
> discussed, it is enough for me that it /could/ work), or
>
> * wrapping
> it with some constant time data conversions (e.g. u::printf).
>
> The hello world program demonstrated that one or the other is necessary.
>

My last mail demonstrated that we don't need either when on windows. printf
just works.

> So, we can forget the earlier silly claim that UTF-8 just magically works,
> and now really compare, for a simplest relevant program.
>

Now we can recall this claim and continue to apply it to your silly claim
that wrapping everything is easier.

[...]
> For an UTF-16 platform a printf wrapper can simply be like this:
>
> inline int printf( CodingValue const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vwprintf( format->rawPtr(), args );
> }
>

Apparently we don't need it. In linux world requesting the user to use
UTF-8 is legitimate. It's already almost everywhere the default. In some
non-linux systems UTF-8 is the default too (Mac OS X?). In windows we can
use narrow printf just fine.

> The sprintf wrapper that I used in my example is more interesting, though:
>
> inline int sprintf( CodingValue* buffer, size_t count, CodingValue
> const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args
> );
> }
>
> inline int sprintf( CodingValue* buffer, CodingValue const* format, ...
> )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), size_t( -1 ),
> format->rawPtr(), args );
> }
>

Oh! thank you! You suggest to wrap each function that comes in two kinds...
You don't need to either wrap or re-implement sprintf for the UTF-8
approach. The whole point of UTF-8 is that it already works with most of
the existing narrow library functions (strlen, strstr, str*, std::string,
etc...) It's simpler, ah!?

The problem that the above solves is that standard vswprintf is not a
> simple wchar_t version of standard vsprintf. As I recall Microsoft's
> [tchar.h] relies on a compiler-specific overload, but that approach does
> not cut it for platform independent code. For wchar_t/char independent
> code, one solution (as above) is two offer both signatures.
>

No such problems in UTF-8 world.

> but anyway you have to do O(N) work to wrap the N library functions you
>> use.
>>
>
> Not quite.
>
> It is so for the UTF-8 scheme for platform independent things such as
> standard library i/o, and it is so also for the native string scheme for
> platform independent things such as standard library i/o.
>

As we see it's the other way around...

> But when you're talking about the OS API, then with the UTF-8 scheme you
> need inefficient string data conversions

It's quite efficient. In fact it was never a bottleneck. Invoking the OS
usually yields complex operations anyway. Moreover, even in non-English
speaking world, most of the text internal to programs is still ASCII. UTF-8
saves space, saves cache usage. This compensates the conversion penalty. To
make a definite statements you must measure. Otherwise it's premature
optimization, if it's an optimization at all.

Also note that in multi-threaded world with hierarchical memory,
computation becomes faster than memory access.

and N wrappers, while with the native string scheme no string data
> conversions and no wrappers are needed.

The difference is what you wrap: the standard interface or the proprietary
OS-interface. We benefit more from wrapping the later, as was done for
hundreds of times in each portable library that tries to accomplish
something beyond primitive file-io. This is because you get a portable
library as a side-product.

>
> Your approach is no way better.
>>
>
>
> I hope to convince you that the native string approach is objectively
> better for portable code, for any reasonable criteria, e.g.:
>
>
> * Native encoded strings avoid the inefficient string data conversions of
> the UTF-8 scheme for OS API calls and for calls to functions that follow OS
> conventions.
>

Stop calling it inefficient. If you store portable data on some storage, or
receive it through networkâ€”as any serious application does todayâ€”you can't
avoid conversions. You just have to decide where you do them, closer to the
OS or further. Anyway, see above.

* Native encoded strings avoids many bug traps such as passing a UTF-8
> string to a function expecting ANSI, or vice versa.
>

Yeah, and "multiple inheritance causes multiple abuse of multiple
inheritance"[1] microsoft said? UTF-8 avoids many bug traps such as
forgetting that UTF-16 is actually a variable length encoding. EVERYBODY
knows that UTF-8 has vaaariiable-lng codepoints.

* Native encoded strings work seamlessly with the largest amount of code
> (Windows code and nix code), while the UTF-8 approach only works seamlessly
> with nix-oriented code.
>

Hmmm... I prefer the later, just to avoid all the boilerplate wrappers for
what has been standard for years. And I'm a windows programmer. Besides,
how will you return unicode from std::exception::what() if not by UTF-8?

Conversely, points such as those above mean that the UTF-8 approach is
> objectively much worse for portable code.
>

Since I'm tired of repeating the same again and again, see "Using the
native encoding" in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

In particular, the UTF-8 approach violates the principle of not paying for
> what you don't (need to or want to) use

UTF-16 violates the principle of you don't pay for what you don't use: If
most of your text is ASCII (which is true for internal text even in
non-English countries) you don't want to waste twice as much memory.

> , by adding inefficient conversions in all directions;

Again? seekg(0) and read(). You'll have to do conversions anyway, e.g. when
you read from a file. You don't store native encoding in portable file, do
you?

> [...] and it violates the KISS principle ("Keep It Simple, Stupid!",
> forcing Windows programmers to deal with 3 internal string encodings
> instead of just 2).

If you're working with 2 encodings, you're doing something terribly wrong.
Seriously, it looks like you're still living in the 20th century. You shall
not use ANSI encodings (other than UTF-8) on windows because they don't
work with Unicode. They are mostly deprecated. Microsoft encourages you to
use either UTF-8 or UTF-16 (
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx
).

Now, assuming you stopped using legacy 'ANSI' encodings, you left with only
UTF-16 (internal) and UTF-8 (external). Replace internal UTF-16 with UTF-8,
and you're left with only ONE encoding used for EVERYTHING, internal and
external. UTF-16 at OS calls doesn't count as it's not stored anywhere
(you're not 'dealing' with it).

[1] from some C# book by microsoft I glanced a few years ago.

-- 
Yakov

Next message: Ion Gaztañaga: "Re: [boost] [interprocess] native Windows cond_var + mutex"
Previous message: Peter Dimov: "Re: [boost] Boost.Algorithm design question"
In reply to: Alf P. Steinbach: "Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows"
Next in thread: Peter Dimov: "Re: [boost] Silly Boost.Locale default narrowstringencodinginWindows"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk