Boost logo

Boost :

Subject: Re: [boost] [filesystem] proposal: treat reparse files as regular files
From: Gavin Lambert (gavinl_at_[hidden])
Date: 2015-07-29 02:09:49


On 29/07/2015 14:06, Niall Douglas wrote:
> NTFS compressed files act exactly like normal files. Reparse point
> files do not and require significant additional processing to figure
> out what kind they are. That's the difference.
>
> From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> metadata about a file entry, it can zero cost learn if an entry is a
> reparse point by examining FileAttributes for the
> FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> reparse point file it is without opening the file and asking.
>
> Windows' CreateFile() API is astonishingly slow. To require calling
> that, then an additional NtQueryDirectoryFile() to fetch the
> FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> is the fastest way I know of to fetch the reparse point tag code -
> would impose an enormous performance penalty for all file entries
> marked with FILE_ATTRIBUTE_REPARSE_POINT.

If it helps,
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365511.aspx
seems to specify that reparse points provide their tag id in the
dwReserved0 field of the WIN32_FIND_DATA structure (I'm not sure how
that maps to the native API, but I assume it's somewhere). That should
be sufficient to identify the reparse point type.

(https://msdn.microsoft.com/en-us/library/windows/desktop/aa365740.aspx
backs this up, incidentally.)

Granted, a single NtQueryDirectoryFile on the whole directory is not
enough to get both sets of data, but you should still be able to do it
in just two calls per directory (times however many calls are required
to fully enumerate the directory, of course).

Presumably you're currently using one of FileBothDirectoryInformation or
FileFullDirectoryInformation. You should be able to switch to the "Id"
variants (FileIdBothDirectoryInformation or
FileIdFullDirectoryInformation) instead (if you're not already using
them). This gives you a FileId for each file, along with the other
information.

After you've enumerated the entire directory, you can go back and get
FileReparsePointInformation for the whole directory, and then match up
the FileId against the FileReference to merge the data and get the
reparse tag for each file.

(I haven't tested this, so I'm not sure if it gives you an empty tag for
files that aren't reparse points, or only lists reparse points. The
latter would be nice, as it would be close to zero overhead for
directories that do not contain reparse points.)

Presumably Win32 FindFirstFile is doing something like this internally,
since it does provide the reparse tag.

I'm not sure if it's current, but
http://blogs.technet.com/b/filecab/archive/2013/02/14/dfsr-reparse-point-support-or-avoiding-schr-246-dinger-s-file.aspx
seems to suggest the following behaviour as reasonable:

  - treating IO_REPARSE_TAG_MOUNT_POINT as directory symlinks
  - treating IO_REPARSE_TAG_SYMLINK as symlinks
  - treating IO_REPARSE_TAG_DEDUP, IO_REPARSE_TAG_SIS, and
IO_REPARSE_TAG_HSM as regular files
  - treating any other tag as something to be ignored (in most cases)

There was also a note that you can use IsReparseTagNameSurrogate to
determine if a given reparse point tag is a surrogate (some kind of
link) or not (treat like regular file). That might be the best option,
if it's consistent -- and at least for the official MS tags it seems to
be; MOUNT_POINT and SYMLINK are surrogates and the other types are not.

> I appreciate you're saying the cost is worth it, but we're thinking
> all Boost users here, not just the small minority on Windows Server
> 2012 with dedup turned on.

I'm not on Server 2012, but this thread caught my attention because I
remember encountering a bug that prevented all WinXP clients from
accessing deduped files on CIFS shares provided by Server 2012. I think
in the end this was a server-side bug related to McAfee and the
different protocols used by WinXP vs. Win7, and so clients shouldn't
normally be able to see whether files are deduped or not remotely, but I
haven't explicitly verified that. If CIFS shares do expose files as
dedup reparse points instead of concealing that then it might affect
quite a lot of users.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk