Boost logo

Boost :

Subject: Re: [boost] [filesystem] proposal: treat reparse files as regular files
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-07-29 06:59:47


On 29 Jul 2015 at 18:09, Gavin Lambert wrote:

> On 29/07/2015 14:06, Niall Douglas wrote:
> > NTFS compressed files act exactly like normal files. Reparse point
> > files do not and require significant additional processing to figure
> > out what kind they are. That's the difference.
> >
> > From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> > metadata about a file entry, it can zero cost learn if an entry is a
> > reparse point by examining FileAttributes for the
> > FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> > reparse point file it is without opening the file and asking.
> >
> > Windows' CreateFile() API is astonishingly slow. To require calling
> > that, then an additional NtQueryDirectoryFile() to fetch the
> > FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> > is the fastest way I know of to fetch the reparse point tag code -
> > would impose an enormous performance penalty for all file entries
> > marked with FILE_ATTRIBUTE_REPARSE_POINT.
>
> If it helps,
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa365511.aspx
> seems to specify that reparse points provide their tag id in the
> dwReserved0 field of the WIN32_FIND_DATA structure (I'm not sure how
> that maps to the native API, but I assume it's somewhere). That should
> be sufficient to identify the reparse point type.

That does help greatly in fact. I know FindXXXFile doesn't open each
file, so somehow or other the Win32 layer is able to fetch the
reparse tag type for directory entries purely from the directory
handle.

> Granted, a single NtQueryDirectoryFile on the whole directory is not
> enough to get both sets of data, but you should still be able to do it
> in just two calls per directory (times however many calls are required
> to fully enumerate the directory, of course).
>
> Presumably you're currently using one of FileBothDirectoryInformation or
> FileFullDirectoryInformation. You should be able to switch to the "Id"
> variants (FileIdBothDirectoryInformation or
> FileIdFullDirectoryInformation) instead (if you're not already using
> them). This gives you a FileId for each file, along with the other
> information.
>
> After you've enumerated the entire directory, you can go back and get
> FileReparsePointInformation for the whole directory, and then match up
> the FileId against the FileReference to merge the data and get the
> reparse tag for each file.
>
> (I haven't tested this, so I'm not sure if it gives you an empty tag for
> files that aren't reparse points, or only lists reparse points. The
> latter would be nice, as it would be close to zero overhead for
> directories that do not contain reparse points.)

Unfortunately getting FileReparsePointInformation returns just a
single record which is the reparse point for the directory handle
being enumerated. It doesn't return reparse tags for directory
contents.

There is an index of all reparse points on a NTFS volume in a magic
NTFS file stream, but that's NTFS specific code, and it requires a
file handle to be opened.

I'm thinking that as reparse points are really just an overload on
EA, maybe the returned EaSize field is magically set to the reparse
tag when attributes specify it's a reparse point file? I'd have to
experiment to find out. I can't see any other obvious field which
would return the reparse tag.

EDIT: What a guess I just made!:
https://www.osronline.com/showthread.cfm?link=171655. Thanks Gavin,
you just solved the problem for AFIO at least.

> > I appreciate you're saying the cost is worth it, but we're thinking
> > all Boost users here, not just the small minority on Windows Server
> > 2012 with dedup turned on.
>
> I'm not on Server 2012, but this thread caught my attention because I
> remember encountering a bug that prevented all WinXP clients from
> accessing deduped files on CIFS shares provided by Server 2012. I think
> in the end this was a server-side bug related to McAfee and the
> different protocols used by WinXP vs. Win7, and so clients shouldn't
> normally be able to see whether files are deduped or not remotely, but I
> haven't explicitly verified that. If CIFS shares do expose files as
> dedup reparse points instead of concealing that then it might affect
> quite a lot of users.

I had understood from the OP that CIFS is exporting the reparse point
tag to clients, hence the breakage.

The reason, I suspect, that CIFS is being so braindead here is that
opening a deduped file is more expensive than usual and clients ought
to know. Which is exactly why I am opposed to treating these things
as a regular file.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk