Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-21 19:59:52


On 01/21/2011 02:21 AM, Peter Dimov wrote:
> Beman Dawes wrote:
>
>> Why not just use Boost.Filesystem V3 for dealing with files and
>> filenames?
>
> The V3 path looks very reasonably designed and I can certainly
> understand why it's the way it is. However...
>
> Let's take Windows. In the vast majority of the use cases that call
> for construction from a narrow string, this string is either (a) ANSI
> code page encoded, (b) UTF-8 encoded. Of these, (b) are people doing
> the Right Thing, (a) are people doing the Wrong Thing or people who
> have to work with people doing the Wrong Thing (not that there's
> anything wrong with that).

There's nothing wrong about either one, although blindly dealing with it
without knowing which would be problematic. Of course this is an
artificial distinction and it might be Shift-JIS or EUC or others as
well. These are all valid and are only a subset of widely used 8-bit
encodings. In general, if someone works only within a particular
language and whatever they're using works for them they aren't much
motivated to change. You won't easily convince them to switch to
utf-8. Although nicely, utf-8 isn't state dependent like many others,
people have long since solved the problems of dealing with whatever
encoding they're used to for their region, and even though utf-8 would
be less problematic they have already solved their problems. You're
just asking them to take on a new set of problems.

>
> v3::path has the following constructors:
>
> path( Source );
> path( Source, codecvt_type const & cvt );
>
> The first one uses std::codecvt<wchar_t, char, mbstate_t> to do the
> conversion, which "converts between the native character sets for
> narrow and wide characters" according to the standard. In other words,
> nobody knows for sure what it does without consulting the source of
> the STL implementation du jour, but one might expect it to use the C
> locale via mbtowc. This is a reasonable approximation of what we need
> (to convert between ANSI and wide) but pedants wouldn't consider it
> portable or reliable. It's also implicit - so it makes it easy for
> people to do the wrong thing.

That is a frustration. The program should check the locale when run and
use the current one. That locale will have the codecvt in it. Now,
many operating systems don't provide a standard locale in the user's
environment, so the default would be "C", a 7-bit US-ascii. You can run
this little program in your environment to see what you get.

#include <iostream>
#include <ctime>
#include <locale>

int main()
{
     std::locale native("");
     std::locale c("C");
     std::locale global;

     std::cout << "native : " << native.name() << '\n';
     std::cout << "classic: " << std::locale::classic().name() << '\n';
     std::cout << "global : " << global.name() << '\n';
     std::cout << "c : " << c.name() << '\n';
     return 0;
}

Also, although it's specified in the standard what will happen, vc++
doesn't always follow the standard exactly, sometimes just because the
standards have left room for interpretation and different operating
systems interpret the same part of the spec differently.

>
>
> The second one allows me to use an arbitrary encoding, which is good
> in that I could pass it an utf8_codecvt or ansi_codecvt, if I find
> some buggy versions on the Web or write them myself. But, since it
> considers all encodings equally valid, it makes it hard for people to
> do the right thing.

Writing a code conversion facet yourself isn't hard, but it's tricky to
make sure all the corner cases work. It will be nicer with c++0x,
because some standard code conversion facets come with it that are
specified clearly enough so that you can rely on them doing the same
thing across operating systems. The truth is that there is a dearth of
high quality code conversion facets available as open source. Lets all
fix that:) One of the problems has been different interpretations of
what a wchar_t is. That's another thing c++0x gets right with the
char32_t and char16_t. No more converting from utf-8 to wchar_t and on
one operating system you get utf-16 and on another utf-32.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk