Boost logo

Boost :

From: Beman Dawes (bdawes_at_[hidden])
Date: 2005-06-09 10:11:49


At 02:31 PM 6/8/2005, Chris Frey wrote:

>On Thu, Jun 02, 2005 at 08:16:59AM -0400, Beman Dawes wrote:
>> * That is a lot of additional interface complexity to support an
>> optimization that applies to Windows but not POSIX. Some of the other
>> schemes (which involved additional overloads to specific operations
>> functions) had less visible impact on the interface.
>
>Windows is not the only system with d_type. I was writing with Linux
>in mind.

Sorry. I usually just look at the POSIX docs when I want to know what
Unix-style operating system functions are available, and forget that there
are also operating system specific extensions available.

>> * There have been no timings to indicate the inefficiency of the
current
>> interface actually impacts production applications.
>
>I'm not sure this is the right view to take when it comes to something
>as low level and highly used as a filesystem.

I should have put that in a broader context. The first priority is getting
the interface functionality, safety, and ease-of-use right. I've been tied
up with those issues as related to the i18n branch for a long time now, but
as of yesterday finished that work to my satisfaction. It will take a
mini-review to see if that work satisfies others, of course.

Once functionality, safety, and ease-of-use concerns are addressed, then
efficiency becomes more important. But because efficiency is such a
notoriously slippery issue, the first step has got to be timings in
realistic use scenarios.

>... I decided to run a test, comparing directory_iterator
>with the find system utility...

Your tests clearly raise directory_iterator performance concerns.

Terence Wilson also reported directory_iterator performance concerns, and
also provided a timing program.

The concern I had with both of these timing tests was that while not
totally apples-and-oranges comparisons, there were still a lot of
uncontrolled variables.

I put together a timing test program (see below) that depends entirely on
Boost.Filesystem operations. Since the only difference between the two
modes of operation is the use of boost::filesystem::status(), any timing
differences are caused by that alone.

The timing differences between the two modes are dramatic. With Windows XP
SP 2, 1 gigabyte main memory, compiled with VC++ 7.1 in release mode, in an
NTFS directory with 15,046 files, run from a freshly booted machine,
average of three runs:

      6.06 seconds with status()
      1.04 seconds without status()

Additional runs (showing no disk activity whatsoever because of disk
caching):

      1.03 seconds with status()
       .31 seconds without status()

The timing differences are explained by watching file activity (using the
Diskmon utility from http://www.sysinternals.com/).

Without status() enabled, there is 1 DIRECTORY action every 34 or so files,
and 1 READ action every 17 or so files.

With status() enabled, there is 1 DIRECTORY action every 34 or so files,
and 1 INFORMATION QUERY action _every_ file. Each of those is actually
causing disk activity , too, based on the state of the disk light (but only
if the cache is cold).

In other words, use of status() is causing roughly 17 times as much disk
activity.

So it looks like a status() overload which takes an iterator directly (with
the one byte status value cached in the iterator) would be a big
performance plus.

Thanks for going to the trouble of doing timings. They were very
motivating!

--Beman
  


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk