Boost logo

Boost :

Subject: Re: [boost] [filesystem] Request for comments on proposed relative() function
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2014-05-15 08:44:53


On 2014-05-15 02:02, Gavin Lambert wrote:
> ...
>>> Although that brings up another question (which I'm not really familiar
>>> enough with the "path" class internals to answer by looking at the
>>> example implementation): is "base" intended to be assumed as a
>>> directory name (which is how most filesystem "make relative"
>>> functions typically work) or as a file name (which is how URL "make
>>> relative" works)?
>>
>> path makes no assumptions about what it references, or whether the
>> pathname even exists in the filesystem.
>
> I'm aware of that; my point was that the semantics of the "make a
> relative path" operation typically differ between filesystems and URLs.

Dear all,

One problem with boost filesystem is the lack of theoretical foundation.
Hey, don't run away. :)

Note: I have not checked the formalities thoroughly, so parts of the
text might be erroneous or incomplete.
Note: I type slashes everywhere because it is easier. Some should be
backslashes on Windows.

What are paths?
===============

Surprisingly this is already a controversial question. From my
experience, I can classify people's attitude to three categories.

"Paths are names of resources, cookies for the user code."
----------------------------------------------------------

This is a justified approach, in simple cases like:

     int main(int argc, char *argv[])
     {
         ifstream in(argv[1]);
         ofstream out(argv[2]);
         out << in.rdbuf();
     }

But obviously there is more to paths than that!

"Paths are strings. Doing simple string operations won't do anybody harm."
--------------------------------------------------------------------------

This is a common approach in scripting languages, for example. There are
plenty of C++ code that does the same thing. This approach assumes
specific connection between syntax and semantics of paths, however.

The most common path operation, namely concatenation, can be easily
defined to be consistent by applying some discipline (at least on Unix
and Windows). Specifically, by ending all directories with a slash,
concatenating any non-file path x with a relative path y can be done
simply by x + y (std::string::operator +).

The problem of this approach is that it does not encompass the meaning
behind the defined operations, making hard to well-define them.

"Paths are sequences of instructions to locate some resources."
---------------------------------------------------------------

After all, this is what they are for the OS. It breaks them to a
sequence of 'path elements' (from here on simply 'elements'), and
depending on each of them, does some state transition on the internal
path resolution state, getting to the desired filesystem node.

This definition is not limited to filesystem paths, it is general enough
to work with "step left, 3 steps back, dig the treasure", or "England,
London, Baker St. 4". We will get to it later.

Definitions
===========

Notation:
* f,g,h filesystem nodes
* a,b,c elements
* x,y,z paths

 From here on, if we quantify over "all filesystem nodes f", the
intention is over all conceivable nodes in all possible machine states,
not only in the concrete filesytem the program is currently run.

Definition: a path is a finite sequence of elements, denoted by x = a0 /
... / an. (Here '/' is only an algebraic notation.)

With filesystem paths, the elements are instructions of how to traverse
the graph of filesystem nodes. Kinds of elements include:

* On Unix: root "/", regular "name", current ".", parent "..".
* On Windows: drive "c:", root "/", regular "name", current "." and
parent "..".

The action of an element on a node, denoted f / a, returns a new
filesystem node as if by a call to openat() on Unix. On Windows there is
no direct analogy of openat, so the semantics are somewhat similar to
both, NtCreateFile() and SetCurrentDirectory(). The action of a path on
a node is defined by f / (a0 / ... / an) = (...(f / a0) / ... ) / an.

(openat() is consistent with this definition, in the sense that you get
the same result whether you pass any string representation (see below)
of (a0 / ... / an) in one call, or invoke it repeatedly for each element.)

Naturally, concatenation of paths x / y is simply concatenation of the
said sequences. The identity f / (x / y) = (f / x) / y holds, by definition.

So far I gave an abstract definition of paths *without saying how they
are represented syntactically*. This point is important because we
cannot map every sequence of elements to a string in the OS specific
syntax. For example, there is no encoding of the sequence "a" / "c:" on
Windows. However:

Definition: x is equivalent to y iff for all f, if g = f / x and h = f /
y are both defined, then g = h (i.e. they yield the same node).

Definition: str(x) is the canonical string representation of the path x.
path(s) is the path resulting from parsing the string s.

I leave str() and path() be implementation defined, subject to the
following requirements:

* path(str(x)) is equivalent to x
* if x and y are equivalent, then str(x) = str(y)

In practice there need no be path() and str() functions. In memory,
paths are already represented as strings. All operations, although
formally defined on abstract paths, can be implemented to work directly
on strings. Therefore, where it is obvious from the context, we can
identify paths x with their string representations str(x). For all
operations, except canonical(s) = str(path(x)), we require that if the
operands are already in canonical string representation, then the result
is also to be in canonical string representation. This lets the
implementation to short-circuit when appropriate.

Operations (examples are for Windows, because it is less trivial than Unix):
* concatenation -- already defined:
     * "a" / "b" = "a/b"
     * "c:" / "d" = "c:d"
     * "/a/b" / "c:" = "c:"
     * "a" / "../b" = "a/../b"
     * "d:/a" / "/b" = "d:/b"
* is_absolute(x) iff there exists f such that for all g, g / x = f.
     * is_absolute("d:/xyz") = true
     * is_absolute("/abc") = false
     * is_absolute("c:") = true or false, depending on whether we
include the environment states into the filesystem node quantification.
* relative(x, y) = z s.t. y / z is equivalent to x. (not always exists)
     * relative("a/b", "a") = "b"
     * relative("c:/a/b", "c:") = "/a/b"
     * relative("c:/a/b", "c:/") = "c:/"
     * relative("c:a/b", "c:/") = NaP
     * relative(x, "") = x

Eval-equivalence
----------------

Definition: x is eval-equivalent to y iff for all f, if g = f / x and h
= f / y are both defined, and none involve symbolic link resolution,
then g = h.

The rest of the previous section is repeated except that
eval-equivalence is used.

* concatenation
     * "a/b" / "./../c" = "a/c"
     * "a" / "../../b" = "../b"
* relative(x, y)
     * relative("a/b", "a/c") = "../b"
     * relative("c:a/b", "c:/") = NaP

Note that on Windows, AFAIK but I may be mistaken, eval-equivalence and
equivalence are the same, because ".." are resolved syntactically by
Win32 API before any reparse points etc. kick-in. One can think of
eval-parent ".." meaning 'go back' whereas the Unix parent ".." behave
as a regular subdirectory. One may still wish to distinguish between
equivalence and eval-equivalence on Windows.

Summary
-----------

Lots of questions are left intentionally un-answered. The above text
establishes a framework, from here we can proceeded in different ways.
For every concrete paths implementation we define
* possible path elements
* the effect of these elements on the filesystem nodes
* path equivalence
* a parsing function path()
* a formatting function str()

Possible variation:
* Define that x is equivalent to y as before, with an added stipulation
that the current "." elements are assumed to be regular. So that "." /
"." = "./." rather than ".".
* Split regular elements to filenames and directories. When parsing
"abc/", path() will return a single directory element, when parsing
"abc" it will return a single filename element. This is similar to the
way URLs work. Usecase: "~/a.txt" / "b.txt" = "~/b.txt".

Where Boost.Filesystem fails?
=============================

Note I wasn't following its evolution in the last year or two, so some
of this section might be outdated.

Path is a sequence of what?
---------------------------

Constructors, value_type, append(), all suggest that path is a sequences
of chars/wchars. begin()/end()/iterator, on the other hand, suggest that
path is a sequence of... paths? O.o

Let's leave alone the fact that using the second as a definition doesn't
work (it's like saying that string is a sequence of strings, though some
languages do exactly this). The bigger problem is that
path::iterator::value_type != path::value_type. What I naturally want to
do when slicing paths, is to construct path(first, last) where first and
last are path::iterators, but I cannot. Currently, at places where this
is the intended operation, the documentations says "as if applying
operator /= repeatedly". Which brings us to the following point.

operator /= is broken
---------------------

The path "c:a/b" decomposes, on Windows, to the sequence {"c:", "a",
"b"}. But applying operator /= to this sequence repeatedly returns "c:/a/b".

Now, according to the documentation, parent_path() of "c:a/b" should
return "c:/a". But guess what, the last time I checked, it returned the
desired "c:a"!

Someone got confused...

absolute(p,base) illogical
--------------------------

The case when p.has_root_name() && !p.has_root_directory() makes no
sense. I remember that it was once left unspecified. Don't know why it
got defined now.

Explanation:
The analogy of what today's implementation does is absolute("England,
Baker St. 4", "France, Paris, Passage Landrieu 8") = "England, Paris,
Baker St. 4". For those who don't like analogies, absolute("c:a/b")
shall return "c:/...current c drive directory.../a/b", not "c:/a/b" as
it does today! More generally, when changing a higher-rank element, it
makes no sense to leave the lower-rank elements which was specified
relative to that higher-rank element!

Note that:
* For other cases, absolute() does the same thing as abstract path
concatenation defined above.
* This case is irrelevant for Unix.

Thus I propose to scrap absolute() altogether.

Bad names
---------

* parent_path() does not return the parent directory! It should be named
pop_back().
* absolute() isn't needed, system_complete() should be renamed to
absolute().
* canonical() -- a better name would be resolve(), cause what it
ultimately does is resolving symlinks? Naming it realpath() is also an
option.

equivalent() could be more useful...
------------------------------------

...if it exposed an encapsulated system specific unique file "key". Then
we could make an unordered_map of files by their unique filesystem keys.

directory iterators
-------------------

I would expect a directory_iterator to return elements rather than
concatenating them with the path passed in the constructor. Similar to
what 'ls' does. recursive_directory_iterators, on the other hand, can be
modeled after a 'find .' and do what it does today. Think of
implementing recursive_directory_iterators with the current public
interface of directory_iterator...

Path arithmetic and symlinks
============================

Path arithmetic, which includes concatenation, relative, is_absolute,
etc., shall work purely syntactically. Generally, operations that work
on paths shall either require that the whole path exists (or they are
creating it), or ignore the filesystem completely. All or nothing. It is
user's responsibility to resolve() the path if he really wants to.

Rationale: by the definitions section, paths are abstract entities
detached from the filesystem. Asking for relative(x,y) or x/y are
legitimate questions on their own, even if the paths do not exist at the
specific point in time and space. Even when x and y are relative to a
yet-unknown base. Making them depend on the current filesystem state is
error prone.

Bottom line
===========

Boost.Filesystem, as it is today, does not have a clear separation of
syntax, semantics and filesystem access. Overall it looks quite messy,
and when I tried to use it, it did not solve the real problems I had at
hand.

RFC and thanks for your attention,
Yakov Galka


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk