Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrow stringencodinginWindows
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-30 14:54:24


On Fri, Oct 28, 2011 at 15:34, Alf P. Steinbach <
alf.p.steinbach+usenet_at_[hidden]> wrote:

> [...]
> There was a claim that the UTF-8 based code should just work,

I can't recall anyone saying this. What people were saying is that it's the
most sane way to write portable code. And if the vendors hadn't been
resisting UTF-8 adoption, it would just work.

> [...]
> * re-implementing
> e.g. the standard library to support UTF-8 (like boost::printf, and
> although I haven't tested the claim that it works for the program we
> discussed, it is enough for me that it /could/ work), or
>
> * wrapping
> it with some constant time data conversions (e.g. u::printf).
>
> The hello world program demonstrated that one or the other is necessary.
>

My last mail demonstrated that we don't need either when on windows. printf
just works.

> So, we can forget the earlier silly claim that UTF-8 just magically works,
> and now really compare, for a simplest relevant program.
>

Now we can recall this claim and continue to apply it to your silly claim
that wrapping everything is easier.

[...]
> For an UTF-16 platform a printf wrapper can simply be like this:
>
> inline int printf( CodingValue const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vwprintf( format->rawPtr(), args );
> }
>

Apparently we don't need it. In linux world requesting the user to use
UTF-8 is legitimate. It's already almost everywhere the default. In some
non-linux systems UTF-8 is the default too (Mac OS X?). In windows we can
use narrow printf just fine.

> The sprintf wrapper that I used in my example is more interesting, though:
>
> inline int sprintf( CodingValue* buffer, size_t count, CodingValue
> const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args
> );
> }
>
> inline int sprintf( CodingValue* buffer, CodingValue const* format, ...
> )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), size_t( -1 ),
> format->rawPtr(), args );
> }
>

Oh! thank you! You suggest to wrap each function that comes in two kinds...
You don't need to either wrap or re-implement sprintf for the UTF-8
approach. The whole point of UTF-8 is that it already works with most of
the existing narrow library functions (strlen, strstr, str*, std::string,
etc...) It's simpler, ah!?

The problem that the above solves is that standard vswprintf is not a
> simple wchar_t version of standard vsprintf. As I recall Microsoft's
> [tchar.h] relies on a compiler-specific overload, but that approach does
> not cut it for platform independent code. For wchar_t/char independent
> code, one solution (as above) is two offer both signatures.
>

No such problems in UTF-8 world.

> but anyway you have to do O(N) work to wrap the N library functions you
>> use.
>>
>
> Not quite.
>
> It is so for the UTF-8 scheme for platform independent things such as
> standard library i/o, and it is so also for the native string scheme for
> platform independent things such as standard library i/o.
>

As we see it's the other way around...

> But when you're talking about the OS API, then with the UTF-8 scheme you
> need inefficient string data conversions

It's quite efficient. In fact it was never a bottleneck. Invoking the OS
usually yields complex operations anyway. Moreover, even in non-English
speaking world, most of the text internal to programs is still ASCII. UTF-8
saves space, saves cache usage. This compensates the conversion penalty. To
make a definite statements you must measure. Otherwise it's premature
optimization, if it's an optimization at all.

Also note that in multi-threaded world with hierarchical memory,
computation becomes faster than memory access.

and N wrappers, while with the native string scheme no string data
> conversions and no wrappers are needed.

The difference is what you wrap: the standard interface or the proprietary
OS-interface. We benefit more from wrapping the later, as was done for
hundreds of times in each portable library that tries to accomplish
something beyond primitive file-io. This is because you get a portable
library as a side-product.

>
> Your approach is no way better.
>>
>
>
> I hope to convince you that the native string approach is objectively
> better for portable code, for any reasonable criteria, e.g.:
>
>
> * Native encoded strings avoid the inefficient string data conversions of
> the UTF-8 scheme for OS API calls and for calls to functions that follow OS
> conventions.
>

Stop calling it inefficient. If you store portable data on some storage, or
receive it through network—as any serious application does today—you can't
avoid conversions. You just have to decide where you do them, closer to the
OS or further. Anyway, see above.

* Native encoded strings avoids many bug traps such as passing a UTF-8
> string to a function expecting ANSI, or vice versa.
>

Yeah, and "multiple inheritance causes multiple abuse of multiple
inheritance"[1] microsoft said? UTF-8 avoids many bug traps such as
forgetting that UTF-16 is actually a variable length encoding. EVERYBODY
knows that UTF-8 has vaaariiable-lng codepoints.

* Native encoded strings work seamlessly with the largest amount of code
> (Windows code and nix code), while the UTF-8 approach only works seamlessly
> with nix-oriented code.
>

Hmmm... I prefer the later, just to avoid all the boilerplate wrappers for
what has been standard for years. And I'm a windows programmer. Besides,
how will you return unicode from std::exception::what() if not by UTF-8?

Conversely, points such as those above mean that the UTF-8 approach is
> objectively much worse for portable code.
>

Since I'm tired of repeating the same again and again, see "Using the
native encoding" in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

In particular, the UTF-8 approach violates the principle of not paying for
> what you don't (need to or want to) use

UTF-16 violates the principle of you don't pay for what you don't use: If
most of your text is ASCII (which is true for internal text even in
non-English countries) you don't want to waste twice as much memory.

> , by adding inefficient conversions in all directions;

Again? seekg(0) and read(). You'll have to do conversions anyway, e.g. when
you read from a file. You don't store native encoding in portable file, do
you?

> [...] and it violates the KISS principle ("Keep It Simple, Stupid!",
> forcing Windows programmers to deal with 3 internal string encodings
> instead of just 2).

If you're working with 2 encodings, you're doing something terribly wrong.
Seriously, it looks like you're still living in the 20th century. You shall
not use ANSI encodings (other than UTF-8) on windows because they don't
work with Unicode. They are mostly deprecated. Microsoft encourages you to
use either UTF-8 or UTF-16 (
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx
).

Now, assuming you stopped using legacy 'ANSI' encodings, you left with only
UTF-16 (internal) and UTF-8 (external). Replace internal UTF-16 with UTF-8,
and you're left with only ONE encoding used for EVERYTHING, internal and
external. UTF-16 at OS calls doesn't count as it's not stored anywhere
(you're not 'dealing' with it).

[1] from some C# book by microsoft I glanced a few years ago.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk