Subject: Re: [boost] Silly Boost.Locale default narrow string encodinginWindows
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-10-27 17:12:53
On 27.10.2011 21:07, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>> On 27.10.2011 20:01, Peter Dimov wrote:
>> > File names on NTFS are not necessarily representable in the ANSI code
>> > page. A program that uses narrow strings in the ANSI code page to
>> > represents paths will not necessarily be able to open all files on the
>> > system.
>> Right, that's one reason why modern Windows programs should best be
>> wchar_t based.
> This is one of the two options. The other is using UTF-8 for
> representing paths as narrow strings. The first option is more natural
> for Windows-only code, and the second is better, in practice, for
> portable code because it avoids the need to duplicate all path-related
> functions for char/wchar_t. The motivation for using UTF-8 is practical,
> not political or religious.
Thanks for that clarification of the current thinking at Boost.
I suspected that people envisioned those two choices as an exhaustive
set of alternatives, what to choose from, but I wasn't sure.
Anyway, happily, the apparent forced choice between two inefficient
ungoods, is not necessary -- i.e. it's a false dichotomy.
For, there are at least THREE options for representing paths and other
strings internally in the program, in portable single-source code:
1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix),
as you described above,
2. narrow character based (UTF-8), as you described above, and
3. the most natural sufficiently general native encoding, 1 or 2
depending on the platform that the source is being built for.
Option 3 means -- it requires, as far as I can see -- some
abstraction that hides the narrow/wide representation so as to get
source code level portability, which is all that matters for C++. It
doesn't need to involve very much. Some typedefs, traits, references.
Prior art in this direction, includes Microsoft's [tchar.h].
For example, write a portable string literal like this:
PS( "This is a portable string literal" )
As compared to options 1 and 2, the benefits of option 3 include:
* no inefficient conversions except at the external boundary of the
program (and then in practice only in Windows, where it's already),
* no problems with software and tools that don't understand a chosen
"universal" (option 1 or 2) encoding,
* no need to duplicate functions to adapt to underlying OS: one has
at hand exactly what the OS API wants.
The main drawback is IMO the need to use something like a PS macro for
string and character literals, or a C++11 /user defined literal/.
Windows programmers are used to that, writing _T("blah") all the time as
if Windows 95 was still extant. So, considering that all that current
labor is being done for no reward whatsoever, I think it should be no
problem convincing programmers that writing a few characters more in
order to get portable string literals, is worth it; it just needs
exposure to examples from some authoritative source...
>> The example that I gave at top of the thread was passing a `main`
>> argument further on, when using Boost.Locale. It causes trouble
>> because in Windows `main` arguments are by convention encoded as ANSI,
>> while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8
>> generally yields gobbledygook, except for the pure ASCII common subset.
> Yes. If you (generic second person, not you specifically) want to take
> your paths from the narrow API, an UTF-8 default is not practical. But
> then again, you shouldn't take your paths from the narrow API, because
> it can't represent the names of all the files the user may have.
That's an unrelated issue, really, but I think Boost could use a "get
undamaged program arguments in portable strings" thing, if it isn't
Cheers & hth.,
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk