Boost logo

Boost :

Subject: Re: [boost] [filesystem] proposal: treat reparse files as regular files
From: Paul Harris (harris.pc_at_[hidden])
Date: 2015-07-26 22:55:35


On 25 July 2015 at 00:56, Niall Douglas <s_sourceforge_at_[hidden]> wrote:

> On 24 Jul 2015 at 16:03, Paul Harris wrote:
>
> > tl;dr : I propose that we treat all non-symlink "reparse_files" as
> > "regular_files".
> >
> > If the boost library user wants to do something special with these plain
> > reparse files, they should use alternative means. But typically they are
> > supposed to be treated as regular files.
> >
> > This means we could drop the "reparse_file" enum, or continue to use it
> for
> > a special-case whats_my_real_status() function.
> >
> >
> > --- Motivation ---
> >
> > Windows Server 2012 uses reparse points to implement deduplification.
> > Those files should be treated as regular files in all circumstances.
> > Currently, they are not classed as "regular" files, so fs::copy() will
> skip
> > those files,
> > and library-user code written to list files based on official examples
> will
> > ignore all dedup'd files.
> >
> > This is causing serious and latent problems at the user end, because
> > deduping only happens occasionally after X days, and users cannot easily
> > check if a file is dedup'd (they look just like regular files).
> >
> >
> > --- Real life example ---
> >
> > Another example of reparse use is the "Symantec Enterprise Vault"
> (version
> > 10), which I found running on one site.
> > It replaces files on the server with reparse-point files.
> > FSUTIL REPARSEPOINT QUERY filename.txt
> > shows the contents of the reparse buffer, which is a URL to an internal
> > HTTP server. The url points to a .asp link with a bunch of codes and
> dates
> > to identify the file in the server.
> > Copy-pasting that URL into a webbrowser allows you to directly download
> the
> > file via the webbrowser, which is pretty neat I suppose.
> >
> > In this case, the reparsed-files in Windows Explorer all have grey X
> > crosses on their file icon. If you "type" them (via cmd) or open them,
> the
> > icon loses the grey cross and the file is no longer a reparse point file.
> >
> > My software refused to read the files because they were "not regular
> > files". Once I adjusted the boost code (described below), my software
> saw
> > them as regular and opened the files. The file icons lost the grey
> cross.
> >
> > SO it seems that the file server automatically downloads and replaces the
> > files with the stored content on demand, and the file reading client
> > program should really just treat these files as normal files.
> >
> >
> > --- Short logic ---
> >
> > reparse files (that are not symlinks) should almost always be treated as
> > plain files.
> > They are a mechanism for MS file servers to store files in clever ways,
> but
> > the client should not care and just read/write them as if they were
> normal
> > files.
> >
> > This is different to all the other "other" files which can't be treated
> > like normal files:
> > block, character, fifo, socket, unknown
> >
> > So, reparse files should not be grouped with the "other" file types.
> >
> > They are also NOT symlinks, and should not be treated as symlinks (which
> > would require special decisions for copying, or querying the status, or
> > checking if the target still exists).
> >
> >
> > --- What are reparse files ---
> >
> > I did some reading, if I understand correctly:
> >
> > Reparse points give drivers (on the server) a chance to get data through
> > some other specialised means (eg query from a cluster store).
> > They are processed by the server, not the client, so clients should treat
> > reparse data as opaque data.
> > EXCEPT for symlink reparse files.
> >
> > https://msdn.microsoft.com/en-us/library/dd541667.aspx
> >
> > quote:"The following reparse tags, with the exception of
> > IO_REPARSE_TAG_SYMLINK, are processed on the server and are not processed
> > by a client after transmission over the wire. Clients should treat
> > associated reparse data as opaque data."
> >
> > It seems like the rest of the tags are used for connecting files to other
> > types of storage (eg long term storage, cluster storage).
> > Clients may need to do something special with SOME reparse point files,
> IF
> > the client cares about how long the file read may take.
> >
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa365505(v=vs.85).aspx
> > quote: "Most applications should take special actions for files that have
> > been moved to long-term storage, if only to notify the user that it may
> > take a while to retrieve the file."
> >
> >
> > --- Changes required ---
> >
> > Option 1: change is_regular_file() to return true where
> type==reparse_file
> > I don't like this option, as library-users could be checking the type
> > directly instead of using is_regular_file().
> >
> >
> > Option 2:
> > These functions return reparse_file:
> >
> > fs::file_type query_file_type(const path& p, error_code* ec)
> > file_status status(const path& p, error_code* ec)
> > file_status symlink_status(const path& p, error_code* ec)
> >
> > They should instead return regular_file instead.
> >
> >
> > --- How to test with dedup files ---
> >
> > Creating dedup'd files is a feature only available on Windows Server
> 2012,
> > I believe,
> > although Windows XP/Vista/7/8/10 clients all can read dedup files.
> >
> > Here is how I created a windows server to test with (for free!) on a demo
> > Azure cloud server.
> > I have one working, so if anyone would like to use it for their testing,
> > let me know.
> >
> > Step one: follow this blog article:
> > http://blogs.technet.com/b/tommypatterson/p/azureservertrial.aspx
> >
> > once the machine was "running" I clicked Connect at the bottom.
> > That gave me an .rdp file which in theory I could use with rdesktop, but
> it
> > uses a DNS name that was only just created, so that didn't work.
> >
> > When you click the name of the server in the list, it shows the public IP
> > on the right.. and the port
> > then you can do this
> > $ rdesktop that.ip.addr:port
> >
> > But only if you have the latest rdesktop AND you have set up kerberos
> > something-something.
> >
> > Instead I found a windows computer and used remote desktop from there.
> >
> >
> > ---
> >
> > Once inside,
> > in the "Server Manager --> Dashboard" window on the screen, click "Add
> > Roles"
> > then go next next until "Server Roles"
> > expand "File and Storage services" , "File and iSCSI" , and tick "Data
> > Deduplication"
> > Then next next etc and Install.
> > Wait a bit... and its done.
> >
> http://www.techrepublic.com/blog/data-center/configuring-windows-server-8-deduplication/
> >
> > ---
> >
> > Continuing on that webpage...
> > Time to enable dedup. There is a temp disk D: so lets enable there.
> >
> > Method 1... I did this and then went to method 2... Start PowerShell,
> type:
> > "Enable-DedupVolume D:"
> >
> > Method 2... in that same Dashboard, hit the 4th button (File and Storage
> > Services)
> > Then Volumes --> Disks
> > click Volume 1 at the top, and then right click D: at the bottom -->
> > Configure Dedup.
> >
> > To try and accelerate this puppy, I set the "age to dedup" to 0 days.
> >
> >
> http://www.techrepublic.com/blog/data-center/windows-server-2012-deduplication-how-and-where-to-tweak/
> >
> > ---
> >
> > Time to make something to dedup. We'll just duplicate the warning.txt
> file
> > that exists on D:
> >
> > In powershell:
> > PS> D:
> > PS> $file = Get-Content DATALOSS_WARNING_README.txt
> >
> > Then, do these 2 commands a bunch of times until "big.txt" gets to say
> 6MB
> > PS> Add-Content big.txt $file
> > PS> $file = Get-Content big.txt
> >
> > Then use windows explorer (or other) to make a dozen copies of big.txt
> >
> >
> > Copy c:\windows\explorer.exe to D:
> > to give it something to dedup
> > Go to D: and then copy-paste explorer.exe a dozen times.
> >
> > In PowerShell, type:
> > PS> Update-DedupStatus -Volume D:
> > PS> Start-DedupStatus -Type Optimization -Volume D:
> >
> > and then wait for it to finish.
> > you can track its progress with:
> > PS> Get-DedupJob
> > PS> Get-DedupStatus -Volume D:
> >
> > ---
> >
> > So, once its deduped, you check.
> > PS> FSUTIL REPARSEPOINT QUERY big.txt
> > you should see that its a reparse point with that 0x800etc0013 code.
> >
> > Copy-paste big.txt to big2.txt and check it with the query, and it should
> > tell you big2 is NOT a reparse point.
> >
> >
> > NOW you have some files to test the boost library...
> > You can't zip them up (they lose the dedup tag), you have to run boost
> > binaries ON the computer in the sky.
> >
> >
> > --- Finish ---
> >
> > Thanks for reading,
> > Paul
>
> I appreciate all the detail, and I'm sure so does Beman who is
> Filesystem's maintainer.
>
> However, they all still look like symlinks to me. Just because the OS
> magically replaces them with the real file on first access is
> immaterial - the same thing could happen on Linux. If you don't treat
> them as symlinks, there is no way of inspecting the link without
> causing it to be auto-downloaded which could be catastrophic in some
> use cases.
>
> I still vote for pseudo-symlinks to be reported by Filesystem as
> symlinks.
>

I did think about that, but the design of these reparse points intends for
these files to be treated as plain files by the client - as per MS
documents.

Plus, I understand it as: the reparse buffer is entirely driver-specific,
and so you can't expect boost or any user program to be able to decode what
is inside the reparse buffer and do anything intelligent. AND the
resolving is done by the driver on the server side. Note that there are
probably a dozen products out there that use these reparse buffers for
their storage solution... its not just windows dedup.

So, I don't see how the client can't do anything intelligent with symlink
knowledge,
AND if boost library users are forced to treat them as symlinks, then you
now have 2 kinds of symlinks:

1) standard symlink, which you really want a shallow copy sometimes, and
you have to be careful of loops ( A -> B -> A )

2) reparse (but not symlink), which you cannot shallow-copy (as far as I
understand), and loops are not possible.

So I've already seen:

* My software doesn't want to follow links, but now the new version will
force me to specifically check if its just a reparse-file and then follow.

* Whole-disk backup software don't follow symlinks because they assume
they'll get the real file later. Reparse (nonsymlink) files do not have
any other "real file" so those files are not being backed up at all right
now.

So treating as a symlink causes more trouble than the helping the one edge
case.

reparse-files-non-symlink is such a specialised case, I'd personally want a
specialised get_reparse_info kind of function, so if I really need to care,
then I can find that information.

Your thoughts?
Cheers,
Paul


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk