Boost logo

Boost :

Subject: Re: [boost] [filesystem] proposal: treat reparse files as regular files
From: Paul Harris (harris.pc_at_[hidden])
Date: 2015-07-29 00:27:45


On 29 July 2015 at 10:06, Niall Douglas <s_sourceforge_at_[hidden]> wrote:

> On 28 Jul 2015 at 20:40, Paul Harris wrote:
>
> > I am _disagree_ with the way dedup'd files are currently treated as a
> > special file (as if they were a device or a character file or a fifo or a
> > socket). device/socket/fifos all need to be read in a special way, but
> > dedup'd files should be read as if they were a plain file.
> >
> > I _disagree_ that a dedup file should be treated as if they are a
> symlink.
> > This is because a dedup file does not point to another file (or inode) on
> > the file system, which is a characteristic of a symlink or a hardlink.
> It
> > is basically just a compressed file. We don't treat NTFS-compressed
> files
> > differently from regular files, why are we treating dedup'd files
> > differently?
>
> NTFS compressed files act exactly like normal files. Reparse point
> files do not and require significant additional processing to figure
> out what kind they are. That's the difference.
>

You only need to process symlink-reparse-point-files.
Dedup reparse point files can be treated the same as a normal file.

>
> From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> metadata about a file entry, it can zero cost learn if an entry is a
> reparse point by examining FileAttributes for the
> FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> reparse point file it is without opening the file and asking.
>
> Windows' CreateFile() API is astonishingly slow. To require calling
> that, then an additional NtQueryDirectoryFile() to fetch the
> FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> is the fastest way I know of to fetch the reparse point tag code -
> would impose an enormous performance penalty for all file entries
> marked with FILE_ATTRIBUTE_REPARSE_POINT.
>
>
I have no comment on performance. I want things to work.

> I appreciate you're saying the cost is worth it, but we're thinking
> all Boost users here, not just the small minority on Windows Server
> 2012 with dedup turned on.
>

You don't seem to understand that this affects ANY Windows client that talks
to a Windows 2012 dedup-enabled server.

Which, as of last month, has gone from zero to 5 different companies in
my world. Seems that all the IT departments are upgrading after the end-of-
financial-year.

So, a Windows 7 user will be accessing dedup files.

> > for (directory_iterator ...)
> > {
> > if (is_symlink(fn)) backup_link(fn);
> > if (is_regular_file(fn)) backup_contents(fn);
> > if (is_directory(fn)) ignore(fn);
> > if (is_other(fn)) ignore(fn);
> > }
> >
> > Currently, this pseudo code would fail to backup any automatic dedup'd
> > files (which are basically any file older than 3 days on some of my
> sites).
> > It fails because a dedup'd file is currently an "other".
> >
> > If you treat a dedup'd file as a symlink, only the "link" will be backed
> up.
> > This link points to a magical place that is impossible to read other than
> > simply reading "fn".
> >
> > So how does this simple program backup the dedup'd file contents?
>
> I appreciate the problem with saying something is a symlink, but
> trying to retrieve the target of that symlink has to error out
> because it's meaningless in the case of a dedup symlink.
>

Please stop calling it "dedup symlink". It is _not_ any kind of symlink.
That is the point of misunderstanding, we are not on the same page.

>
> What seems to me the best route forward is you do something like
> this:
>
> if (is_symlink(fn))
> {
> error_code ec;
> auto target=read_symlink(fn, ec);
> if(!ec)
> backup_link(fn);
> }
>
> Because is_regular_file() and is_directory() use status(), they
> follow any symlink so you can safely fall through to those.
>
>
This is unacceptable, because I do not want to follow symlinks.
That was specified in the example.

Lets be more specific about the example directory to backup.

On Monday, it contains:
FILE_A (a plain file)
FILE_B (a symlink to FILE_A)
FILE_C (a plain copy of FILE_A)

Backup should store this:
FILE_A contents. FILE_B link. FILE_C contents.

On Tuesday, dedup/archival has run on the server. Directory now contains:
FILE_A (a dedup file)
FILE_B (a symlink to FILE_A)
FILE_C (a dedup file)

Backup SHOULD store this:
FILE_A contents. FILE_B link. FILE_C contents.

IF you treat dedup=symlink, then the example will instead store:
FILE_A link. FILE_B link. FILE_C link.
(although I have no idea what "FILE_A link" will actually read)

If you follow symlinks, then backup stores the wrong thing:
FILE_A contents. FILE_B contents (WRONG). FILE_C contents.

If you treat dedup files as regular files, then backup stores correctly:
FILE_A contents. FILE_B link. FILE_C contents.

cheers,
Paul


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk