Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like inthe future [was Always treat ... ]
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-19 13:43:55

At Wed, 19 Jan 2011 19:09:48 +0200,
Peter Dimov wrote:
> Dave Abrahams wrote:
> > *Scenario D:* We try for scenario A. and people still use Qstrings,
> > wxStrings, etc.
> >
> > *Scenario E:* We add another string class and everyone adopts it
> The problem with using an Unicode string, be it QString or
> utf8_string, to represent paths is that it forces you to pick an
> encoding under POSIX. When the OS gives you a file name as char*, to
> store it in your Unicode string, you have to interpret it. Then, to
> give it back to the OS, you have to de-interpret it.

Nonono; if you don't want to choose an encoding, you store it as a
raw_string, (a.k.a. std::string, for example)!

The whole point is to separate by type the things we know how to
interpret from the things we don't.

Please tell me if I'm missing something that's still important below
after my explanation above. I only skimmed because it mostly seemed
to be based on a misinterpretation of my proposal.

> This forces you to choose between two evils: you can opt to use a
> single byte encoding such as ISO-8859-1, which gives you perfect
> round-trip, but leads to the problem that people can enter a
> Cyrillic file name in your Unicode-enabled GUI and see something odd
> happen on disk, even when their shell is configured as UTF-8 and can
> show Cyrillic names. Or, you can choose to use UTF-8, in which case
> the OS can give you a name which you can't decode properly, because
> it's invalid UTF-8.
> There is no single good answer to this, of course; even if you go with
> my recommended approach as treating paths as byte sequences unless and
> until you need to display them (in which case you treat them as
> UTF-8), there'll still be paths that won't show up properly on the
> screen. But the program will be able to work with them, even if they
> are undisplayable.
> To give a simple example:
> int my_main( int ac, char const* av[] )
> {
> my_fopen( av[1] );
> }
> Since files can have arbitrary byte sequences as names under POSIX
> (Mac OS X excluded), if my_fopen insists on taking valid UTF-8, it
> will refuse to open the file.

Dave Abrahams
BoostPro Computing

Boost list run by bdawes at, gregod at, cpdaniel at, john at