Boost logo

Boost :

Subject: Re: [boost] boost filesystem path as utf-8?
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-24 04:44:20


On Mon, Jan 23, 2012 at 21:52, Beman Dawes <bdawes_at_[hidden]> wrote:
> On Mon, Jan 23, 2012 at 9:28 AM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
>
>> On Mon, Jan 23, 2012 at 14:47, Beman Dawes <bdawes_at_[hidden]> wrote:
>>
>> > On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill_at_[hidden]>
>> > wrote:
>> > [...]
>> >
>> > > Unfortunately it boils to the interface whence you can
>> > > get a c_str() to a UTF-16 string only.
>> >
>> > That's not correct.
>> >
>>
>> It's correct. I state that path::c_str() returns UTF-16 on Windows. It's a
>> fact. So the encoding isn't an implementation detail but a part of the
>> interface.
>>
>
> As quoted above, you said only that "...the interface whence you can get a
> c_str() to a UTF-16 string only."

Don't be picky at words. Yes, this sentence might be ambiguous. But I
say that the correct resolution, using C++ name lookup rules, is "you
can get a path::c_str() to a UTF-16 string only".

> The interface includes multiple observers, which return values with various
> encodings other than UTF-16. The return types from the observers allow
> c_str() to access those values.

Since you didn't read it, I'll repeat it again: path::string().c_str()
is a *temporary*. path::c_str() is NOT. The two has difference
semantics, and your library starting with version 3 doesn't let the
user choose what string path holds inside. As said above, it's not an
implementation detail since it's observable from the interface.

> During the design discussions, two other alternatives were discussed. (1)
> Always hold the path internally in a char string encoded UTF-8. The cost on
> Windows is that a conversion has to be done before every file system
> operation.
Not an issue, because:
1) last time I measured with CreateFile and a naive implementation
using MultiByteToWideChar it took less than 3% overhead. Faster
conversions routines exist and you will have to do the conversions
anyway when you communicate with the external world.
2) Let the user choose between narrow chars and wide chars. Why do you
force me to use the later? Why getting the filename from a UTF-8
std::string must involve 2 conversions (to and from UTF-16) even if I
don't pass anything to the system?

> The cost on POSIX is that a double conversion has to be done
> before every file system operation if the encoding is not UTF-8.

1) Most POSIX systems use UTF-8 these days.
2) It's fine if it will be the native encoding on POSIX, as long as
the user can override it. On windows she just can't do this because
boost::path uses wide string.

> (2) Hold
> two strings internally, one in the native type and encoding, the other in
> UTF-8. The cost is trying to keep them in sync, with the conversions that
> implies, for some definition of "in sync".

I 100% agree (2) is not an option.

> If class std::basic_string itself had better support for string
> interoperability, class path would be able to side step at least some of
> the conversion headaches.

Maybe, but almost surely not. It would just shift the burden to other
place—the user.

What you didn't say is that *during original filesystem review* it had
a templatized basic_path and the user *could choose* between narrow
and wide strings. Add this option to the list above.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk