Boost logo

Boost-Build :

Subject: Re: [Boost-build] [win32] Inefficiencies in builtin_check_if_file
From: Steven Watanabe (watanabesj_at_[hidden])
Date: 2012-04-24 16:22:52


AMDG

On 04/24/2012 11:01 AM, François Jacques wrote:
> Hello,
>
> The windows implementation of file_query requires excessive time, and
> is mostly I/O bound. I've been able to reduce the amount of time by
> replacing the stat call in filent.c::file_query by
> GetFileAttributesExA. The later is significantly faster than stat
> since it doesn't attempt to expand the path it is being given. On a
> large project, this made a big difference.
>
> But that's not all, and the next issue is more interesting to fix.
> Follow the white rabbit :
>
> First, stat/GetFileAttributesExA is called to update bjam's
> file_info_t record that gets created whenever file_info is called with
> a path that's not already in the filesys.c file_info_t hashtable :
>
> file_info_t * ff = file_info( filename );
> if ( ! ff->time )
> { /* call stat or GetFileAttriubutes */ }
>
> Then, in file_info function, bjam makes sure that the key is sane by
> calling path_as_key :
>
> file_info_t * file_info( OBJECT * filename ) {
> /* snip */
> filename = path_as_key( filename );
> /* snip */
> }
>
> On Win32, path_as_key calls FindFirstFileA through ShortPathToLongPath
> and its use is questionable.
>
> Again, but that's not the point here, GetFileAttributes could have
> been used instead as is likely to give a slightly better performance.
> Anyhow. What bugs me here is the WIN32_FIND_DATA structure in there
> that will end up containing all the information to populate the
> file_info structure... and is thrown away! Instead, the file_info gets
> updated through stat and GetFileAttributes in upper levels. Why? Why
> hitting the file system twice for each file?
>
> Taking a step back, is ShortPathToLongPath required at all? All other
> platforms are not using that mechanism. Is it to ensure that short
> paths ("progra~1") gets expanded to "Program Files" ? The name of the
> function hints in that direction. I can see how it could be used to
> avoid having two cache entries for a given path that's expressed with
> short path notation to save memory, but the performance hit from I/O
> is terrible on large projects.
>
> Perhaps path_as_key and ShortPathToLongPath should receive an output
> file_info argument that would be populated with the information
> retrieved by the FindFirstFile call? In such scenario, the stat call
> in file_query would become redundant on Windows but since the
> file_info structure is already populated, the stat call would not be
> performed.
>
> Please have a look at the codebase and correct me if I'm wrong. The
> profiler says I'm right though :)
>

This all makes perfect sense. Patches welcome,
otherwise I'll implement it some time after the
next release.

In Christ,
Steven Watanabe


Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk