Boost logo

Boost :

Subject: Re: [boost] Silly Boost.Locale default narrow string encodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-11-01 11:03:23


On 31.10.2011 18:18, bugpower wrote:
> Alf, All,
>
> What replies seem to be missing here is that what you call the "least
> surprise" behavior of the code with argument of main(), is simply incorrect
> from the software engineering point of view. Let me explain:
>
>> 3. the most natural sufficiently general native encoding, 1 or 2
>> depending on the platform that the source is being built for.
>
> Now, when accepting filename from the user's command line on Windows, it is
> simply not possible to use narrow-string version of main().

Well, there are three aspects of that claim:

   1 The limitations of `main` in Windows.

Regarding aspect (1), the C++ Standard does describe the `main`
arguments as "MBCS" strings, meaning they can (should) be encoded with
possibly two or more bytes per character, as in for example UTF-8, which
to me is strongly implied. However, the Windows convention for C++
executable narrow character set predates the C++ standard by a long
shot, and even predates the C standard, and is Windows ANSI. And that
convention is /very/ deeply embedded, not only in the runtime library
implementations but e.g. in how Visual C++ translates string literals.

   2 What you're trying to communicate.

Regarding aspect (2), more quoted concrete context could help make it
more clear to readers what you're trying to say.

I'm not a telepath. But it does sound like you're arguing against a
straw man of your own devising. As if someone had argued for using
ANSI-encoded arguments in general, or as some solution of i18n.

So, I will put my initial remark above, more strongly:

Please always /quote/ what you're referring to.

Especially when you are offering something that sounds as an argument
against something, then please /quote/ what you're referring to.

   3 The literal claim that "it is simply not possible to use narrow-
     string version of main()".

Regarding aspect (3), this claim is incorrect.

However, many people think that one has to use a non-standard startup
function like WinMain, that one has to ditch some parts of the C++
standard as soon as one does Windows, so, some basic technical fact:

The GNU toolchain (the g++ compiler) happily accepts a standard `main`
startup function without further ado, regardless of Windows subsystem.

The Microsoft toolchain (the Visual C++ compiler and linker), however,
is less adept at recognizing your startup function as such. So with the
MS toolchain you have to specify the startup function explicitly if
you're building a GUI subsystem program and want a standard `main`. The
relevant linker options: "/entry:mainCRTStartup /subsystem:windows".

   ---

Finally, note how I had to cover a lot of bases and use a lot of time on
responding to your single little sentence.

That's because that sentence was very *unclear* and *misleading*, and,
given the next sentence, quoted below, I hope it was not so by design.

> Your code cannot
> enforce your user to limit his input to characters representable in the
> current ANSI codepage.

Ignoring the misleading "your", and responding to the technical content
only:

the previous sentence talked about `main` arguments, and this following
sentence talks about "input", so it seems that you are confusing two
different aspects that have very different behaviors in Windows.

In Windows the program arguments are always passed to the process as a
single UTF-16 encoded command line string, available via the API
function GetCommandLine.

For the command line it is therefore meaningless to talk about
restricting the user.

Standard /input/, OTOH., is always passed via some narrow character
encoding, which does not include UTF-16, and which by convention is
neither ANSI nor UTF-16 but the extraordinarily impractical OEM codepage
(on English PC that codepage is the original IBM PC char set). Happily
it is possible to change the narrow character encoding used for input.
For example, it can be changed to UTF-8 as the external encoding. This
is called the "active codepage" in a console window, and it can also be
changed by the user, e.g. commands 'mode' and 'chcp'.

The dangers of selecting UTF-8 as active codepage in a command
interpreter console window, have been discussed else-thread; in short,
Microsoft has a large number of ridiculous bugs in their support.

But that discussion also showed that it's (very probably) OK under
program control.

> If the command line parameter is a filename as in the
> example you suggested, you cannot tell them "never double click on some
> files" (if a program used in a file association).

What example?

Please always quote what you refer to, and quote enough of that context:
don't be ambiguous, don't leave it to readers to infer a context based
on your possibly wrong understanding of it.

Anyway, text passed as `main` arguments can be e.g. the user's name,
which is not necessarily a filename.

> Supporting is always
> better than white-listing, so the only acceptable way of using command line
> parameter which is a filename on windows is with UTF-16 version - _tmain().

Oh dear.

Are you seriously suggesting using `_tmain` to keep compatible with
Windows 9x? Note that for Windows 9x, `_tmain` maps to standard `main`.
And note that in Windows, `main` has Windows ANSI-encoded arguments...

`_tmain` is a Microsoft macro that helped support compatibility with
Windows 9x before the Layer for Unicode was introduced in 2001.

`_tmain` maps to narrow character standard `main` or wide character
non-standard `wmain` depending on the `_UNICODE` macro symbol.

`_tmain` was an abomination even in its day, and today there are no
reasons whatsoever to obfuscate the code that way.

> Then, proceed as Artyom explained. The surprise is then justified - it
> prevented a hard-to-spot bug. My preference on Windows though would be
> different (and not due to religious reasons) - convert all API-returned
> strings to UTF-8 as soon as possible and forget about encoding issues for
> good.

No, it does not let you forget about encoding issues.

Rather it introduces extra bug attractors, since you have then
overloaded the meaning of char-based text. By convention in Windows and
in most existing Windows code, char is ANSI-encoded, so other code will
expect ANSI encoding from the UTF-8 based code, which will tend to
introduce bugs. And other code will produce ANSI encoded text to the
UTF-8 based code, which will tend to introduce bugs. Thus adding another
possible encoding is absolutely not a good idea wrt. bugs.

And you're adding inefficiency for all the myriad internal conversions.

And you're adding either an utterly silly restriction to English A-Z in
literals, or, for the let's-lie-to-the-compiler UTF-8 source without BOM
served to the Visual C++ compiler, requiring that the wide character
literals language feature is not used, and hoping for the best with
respect to how smart later versions of the compiler will be.

And most Windows libraries abide by Windows conventions, so it means
extra work for supporting most library code. I.e. O(n) work for writing
inefficient data converting wrappers for an unbounded set of functions
instead of just O(1) work for writing efficient pointer type converting
wrappers for a fixed set of functions. Think about it.

As far as I know there is not *one single technical aspect* that the all
UTF-8 scheme solves. I.e., AFAIK from a purely technical POV it's dumb.

> See
> http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Hm, was that an associative reference?

Let me quote from the question:

<quote>
For example, try to create file names in Windows that include these
characters; try to delete these characters with a "backspace" to see how
they behave in different applications that use UTF-16. I did some tests
and the results are quite bad:

     Opera has problem with editing them (delete required 2 presses on
backspace)
     Notepad can't deal with them correctly (delete required 2 presses
on backspace)
     File names editing in Window dialogs in broken (delete required 2
presses on backspace)
     All QT3 applications can't deal with them - show two empty squares
instead of one symbol.
     Python encodes such characters incorrectly when used directly
u'X'!=unicode('X','utf-16') on some platforms when X in character
outside of BMP.
     Python 2.5 unicodedata fails to get properties on such characters
when python compiled with UTF-16 Unicode strings.
     StackOverflow seems to remove these characters from the text if
edited directly in as Unicode characters (these characters are shown
using HTML Unicode escapes).
     WinForms TextBox may generate invalid string when limited with
MaxLength.
</quote>

Here the poster lists concrete examples of how many common applications
already have bugs in their Unicode handling.

Showing by example that Unicode is tricky to get right.

Is it then a good idea to needlessly, and at great cost, add further
confusion about whether narrow characters are encoded as ANSI or UTF-8?

Cheers & hth.,

- Alf


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk