Boost logo

Boost :

Subject: [boost] [Filesystem] Proposal: make filesystem generic-programming friendly
From: Alexander Lamaison (awl03_at_[hidden])
Date: 2014-01-09 17:32:38


Dear all,

I'm increasingly finding that I need to write programs that can operate
over multiple kinds of filesystem, including the local filesystem,
network filesystems such as (S)FTP, filesystems in archives such as zip
files, and mock filesystems that exist only in memory for unit testing.

This is not as easy as it could be. The difficulty is that the
Boost.Filesystem API makes it difficult to write generic code that uses
the local filesystem as just one of several API-compatible filesystems.

Therefore, I've written a proposal for making a backward-compatible
addition to the Boost.Filesystem API that would make it much more
friendly to generic programming. It is a first draft and I welcome your
comments, criticism and suggestions for improvement.

The proposal is available here:
http://alamaison.github.io/2014/01/09/proposal-generic-filesystem/ and
I've included a text-only version below.

Alex

Proposal: Generic filesystem API
================================

[FIRST DRAFT]

C++ streams provide a common interface to operate on file-like data,
regardless of how that data it is actually stored. However, programs
often need to operate not just on data from different file sources,
but also on entirely separate filesystems.

The problem
-----------

There is, currently, no way to operate on different filesystems
generically, if Boost.Filesystem is to be one of those filesystems.
The local-filesystem operations in the Boost.Filesystem API make it
difficult to call them generically because they, inadvertently, defeat
ADL, so generic code cannot resolve them to specific
implementations.

For example, how do you write code that calls `path
temp_directory_path();` and can operate on both the local filesystem,
via Boost.Filesystem, and on a filesystem over FTP, via an FTP library?
If both libraries declare the function, how do you resolve the correct
implementation? You can't use a namespace as a type.

Normally, ADL is the solution to the problem; it resolves the correct
namespace based on the namespace of the operation's argument(s). But
`temp_directory_path()` doesn't take an argument.

The source of the problem is a misconception that the free-functions
performing FS operations are part of the `class path` API, when really
they are part of the API of an implicit local-filesystem object. The
Boost.Filesystem FAQ asks:

> **Why are paths sometimes manipulated by member functions and
> sometimes by non-member functions?**
>
> The design rule is that purely lexical operations are supplied as
> class path member functions, while operations performed by the
> operating system are provided as free functions.

This is wrong because the majority of the non-member operations don't
manipulate paths at all. They are functions that manipulate the
filesystem, some of which use a path, and, therefore, are really part of
the API of an implicit local filesystem object. The proposed solution
(later) just makes this explicit, in a backward-compatible way.

Is modifying Boost.Filesystem necessary?
----------------------------------------

Before we discuss our proposal, let us explore how much can be
achieved without any changes to Boost.Filesystem, using the typical
ADL approach to generic programming.

### Limited solution: ADL on filesystem-specific path

If each filesystem were to declare a `path` class conforming to the
`boost::filesystem::path` interface, ADL could resolve a *subset* of the
filesystem operations.

The example below implements a simple generic algorithm,
`remove_if_temporary` over both Boost.Filesystem and an imaginary
`ftp_filesystem`.

#include <iostream>
#include <boost/filesystem.hpp>

namespace ftp_filesystem {

    class path
    {
    public:

        path(const std::string& server, const std::string& location)
            : server_(server), location_(location) {}

        std::string string() const
        { return location_; }

        /* ... */
    private:
        std::string server_;
        std::string location_;
    };

    void remove(const path& target)
    {
        std::cout << "Removing " << target.string()
                  << " over FTP" << std::endl;
    };

}

template<typename Path>
void remove_if_temporary(const Path& file_path)
{
    if (file_path.string()[file_path.string().size() - 1] == '~')
    {
        remove(file_path);
    }
}

int main()
{
    remove_if_temporary(boost::filesystem::path("/tmp/bob"));
    remove_if_temporary(boost::filesystem::path("/home/bob~"));

    remove_if_temporary(
        ftp_filesystem::path("ftp.example.com", "/tmp/bob"));
    remove_if_temporary(
        ftp_filesystem::path("ftp.example.com", "/home/bob~"));
}

The key to this solution is that each filesystem's namespace maintains
a separate `path` type. The algorithm is instantiated with different
path types, allowing ADL to resolve the necessary implementations.

However, this is not a good solution because it is continues the
pretence that operations are part of `class path`, but this pretence
only stretches so far. The result is that:

1. Operations that don't take a `path` argument can't be used in generic
   algorithms. Some operations, such as `temp_directory_path()` just
   *return* a `path`. ADL won't resolve those.
2. `path` becomes coupled to a particular filesystem.
   `boost::filesystem::path` is a currently a good abstraction of many
   filesystem's path representation, but making it as the ADL key that
   resolves operations to their implementation means that other
   filesystems can't use it as their path representation (would
   `BOOST_STRONG_TYPEDEF` solve this?).
3. `path` must carry filesystem instance data. The local
   filesystem is a singleton so its path only needs the
   location-on-disk string. But other types of filesystem may be
   instantiated with host names, authentication information, running
   network sessions, etc. The path instances have to carry this
   extra baggage.

In addition, important parts of the API, namely file opening/creation
and directory iteration, can still not be used generically, without
changes to Boost.Filesystem, because they are *classes* constructed
with a `path`, rather than free functions taking a `path` argument.
Solving this will need additional factory functions for each class so
that ADL can resolve them.

**In summary**: generic filesystem programming is possible with no
changes to Boost.Filesystem if generic algorithms don't open files,
create files, iterate directories or use any nullary filesystem
operations. Even accepting these limitations, the API is not ideal
because of undesirable coupling between `path` and the specifics of
a filesystem.

### Proposed solution: Add local filesystem object to Boost.Filesystem

The main problem with the above solution is that nullary operations
cannot be resolved by ADL. In effect, these operations take a hidden
local filesystem argument, but, as the argument is implicit, ADL
cannot use it to resolve them

The suggestion in this proposal is to make the filesystem object
explicit so that:

- ADL can resolve all the operations using the namespace of this
  object; or
- the operations can be resolved as methods of this object.

Either change is backward-compatible. The existing functions remain in
place, and code that only uses the local filesystem need never refer
to the local filesystem object explicitly.

The example below demonstrates the second option. This is my
preferred option because all filesystems, except the local one, are
likely to implement their operations in terms of private data in the
filesystem object. Passing this object to free functions, instead of
calling a method, means this private data must be exposed.

The example below demonstrates a partial implementation of the
additional new `class filesystem`, and shows how to use it generically.

#include <boost/filesystem.hpp>

namespace ftp_filesystem {

    class filesystem
    {
    public:

        typedef boost::filesystem::path path;

        filesystem(const std::string& server)
            : server_(server) {}

        path temp_directory_path() const
        {
            std::cout << "Querying " << server_
                      << " for $TMPDIR\n";
            return "/tmp";
        }

        void remove(const path& target)
        {
            std::cout << "Removing " << target.string()
                      << " on " << server_
                      << " over FTP" << std::endl;
        };

    private:
        std::string server_;
    };

}

/* Added to existing Boost.Filesystem API */
namespace boost { namespace filesystem {

    class filesystem
    {
    public:

        typedef boost::filesystem::path path;

        path temp_directory_path() const
        {
            return temp_directory_path();
        }

        void remove(const path& target)
        {
            remove(target);
        };

    private:
        filesystem() {}

        friend boost::filesystem::filesystem& local_filesystem();
    };

    /* Local filesystem is singleton */
    inline boost::filesystem::filesystem& local_filesystem()
    {
        static filesystem instance;
        return instance;
    }
}}

template<typename Filesystem>
void remove_if_temporary(
    Filesystem& fs, const typename Filesystem::path& file_path)
{
    std::string name = file_path.filename().string();

    if (name[name.size() - 1] == '~' ||
        file_path.parent_path() == fs.temp_directory_path())
    {
        fs.remove(file_path);
    }
}

int main()
{
    ftp_filesystem::filesystem ftp_fs("ftp.example.com");

    boost::filesystem::filesystem& local_fs =
        boost::filesystem::local_filesystem();

    boost::filesystem::path in_temp_dir("/tmp/bob");
    boost::filesystem::path temp_by_name("/home/bob~");

    remove_if_temporary(ftp_fs, in_temp_dir);
    remove_if_temporary(ftp_fs, temp_by_name);

    remove_if_temporary(local_fs, in_temp_dir);
    remove_if_temporary(local_fs, temp_by_name);
}

As well as the technical reasons for introducing this change, I
believe it better reflects the division of responsibilities between
`class path` and the filesystem. Paths are not coupled to the
filesystem and can be shared between supporting filesystems. If a
particular filesystem needs a different path implementation, that is
supported too and it is up to the generic algorithm whether it works
with all path implementations or just `boost::filesystem::path`
instances. Filesystem details are encapsulated in the filesystem
object alone and it performs all the operations using those internal
details.

    The example above implements <code>class filesystem</code> in
    terms of the existing operations functions so that it will compile
    with no further changes. A better idea may be to implement the
    existing operations in terms of the new filesystem object's
    methods because that better reflects their meaning.

Impact on Boost.Filesystem
--------------------------

The impact on existing users of Boost.Filesystem should be minimal.
The only change they may notice is extra documentation for generic use
of the API.

### Backward compatibility

No existing code using Boost.Filesystem will need any changes. The
existing free functions remain in place as a sensible convenience for
the common case where only the local filesystem is needed.

### Additions

A new object `boost::filesystem::filesystem` is introduced, along with
its singleton accessor function, `local_filesystem`.

Depending which of the two API options is chosen:

- the operation free-functions are overloaded to take an instance of
  `boost::filesystem::filesystem`; or
- `class boost::filesystem::filesystem` is given methods that perform
  each of the operations currently in free functions.

Even functions `absolute()` and `canonical()`, which may seem to be
part of the path API, are really part of the filesystem object's API:
they can be implemented solely in terms on `class path`'s public API
but not solely in terms of `class filesystem`'s API.

Finally, the classes in the API that perform operations on the
filesystem, e.g. `fstream`, `directory_iterator` require one of the
following changes:

- Add factory functions for each class to `namespace
  boost::filesystem`. Allows them to be resolved by ADL. Generic
  callers use auto as return type.
- Add static factory functions for each class to the `path`
  class. Allows them to be resolved by ADL. Generic callers use auto
  as return type.
- Add typedefs for each class to `path` class. Callers construct
  classes directly.

Factory functions for streams will require a workaround in C++03
because the standard streams are not movable.

TODO

----
- Define filesystem-path concept
- Define filesystem concept
--
Swish - Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk