Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future
From: Beman Dawes (bdawes_at_[hidden])
Date: 2011-01-21 08:55:46


On Fri, Jan 21, 2011 at 5:21 AM, Peter Dimov <pdimov_at_[hidden]> wrote:

> Beman Dawes wrote:
>
> Why not just use Boost.Filesystem V3 for dealing with files and filenames?
>>
>
> The V3 path looks very reasonably designed and I can certainly understand
> why it's the way it is. However...
>
> Let's take Windows. In the vast majority of the use cases that call for
> construction from a narrow string, this string is either (a) ANSI code page
> encoded, (b) UTF-8 encoded. Of these, (b) are people doing the Right Thing,
> (a) are people doing the Wrong Thing or people who have to work with people
> doing the Wrong Thing (not that there's anything wrong with that).
>

Sure, but anything other than that would be untenable. Programmers will
assume that the default is for a narrow string to be treated exactly the way
it would be treated in a call to the C library's fopen(), and doing
something different would cause endless real-world bugs.

> v3::path has the following constructors:
>
> path( Source );
> path( Source, codecvt_type const & cvt );
>
> The first one uses std::codecvt<wchar_t, char, mbstate_t> to do the
> conversion, which "converts between the native character sets for narrow and
> wide characters" according to the standard. In other words, nobody knows for
> sure what it does without consulting the source of the STL implementation du
> jour, but one might expect it to use the C locale via mbtowc. This is a
> reasonable approximation of what we need (to convert between ANSI and wide)
> but pedants wouldn't consider it portable or reliable. It's also implicit -
> so it makes it easy for people to do the wrong thing.
>

std::codecvt<wchar_t, char, mbstate_t> is the type, but for windows the
actual object used is a custom codecvt that uses Windows
MultiByteToWideChar() for the ANSI or OEM codepage, as determined by
AreFileApisANSI(). But your point is correct, but only if you believe
defaulting to the platform's usual open/fopen() behavior is the wrong
thing.

> The second one allows me to use an arbitrary encoding, which is good in
> that I could pass it an utf8_codecvt or ansi_codecvt, if I find some buggy
> versions on the Web or write them myself. But, since it considers all
> encodings equally valid, it makes it hard for people to do the right thing.
>

What I'm suggesting is that people who want to use Unicode use wchar_t
strings now, and char16_t or char32_t strings in C++0x.

For general string use, rather than just paths, I'd like Boost to supply
non-templated Unicode string classes:

* u8_string, u16_string, and u32_string, with guaranteed internal
representations.

* utf_string with an internal representation that is one of the above, but
chosen at run-time.

All would, like boost::path, supply member function templates that take any
of the above, as well as std and UDT types.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk