Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like inthe future [was Always treat ... ]
From: Peter Dimov (pdimov_at_[hidden])
Date: 2011-01-19 12:09:48


Dave Abrahams wrote:

> *Scenario D:* We try for scenario A. and people still use Qstrings,
> wxStrings, etc.
>
> *Scenario E:* We add another string class and everyone adopts it

The problem with using an Unicode string, be it QString or utf8_string, to
represent paths is that it forces you to pick an encoding under POSIX. When
the OS gives you a file name as char*, to store it in your Unicode string,
you have to interpret it. Then, to give it back to the OS, you have to
de-interpret it. This forces you to choose between two evils: you can opt to
use a single byte encoding such as ISO-8859-1, which gives you perfect
round-trip, but leads to the problem that people can enter a Cyrillic file
name in your Unicode-enabled GUI and see something odd happen on disk, even
when their shell is configured as UTF-8 and can show Cyrillic names. Or, you
can choose to use UTF-8, in which case the OS can give you a name which you
can't decode properly, because it's invalid UTF-8.

There is no single good answer to this, of course; even if you go with my
recommended approach as treating paths as byte sequences unless and until
you need to display them (in which case you treat them as UTF-8), there'll
still be paths that won't show up properly on the screen. But the program
will be able to work with them, even if they are undisplayable.

To give a simple example:

int my_main( int ac, char const* av[] )
{
    my_fopen( av[1] );
}

Since files can have arbitrary byte sequences as names under POSIX (Mac OS X
excluded), if my_fopen insists on taking valid UTF-8, it will refuse to open
the file.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk