Boost logo

Boost :

From: Caleb Epstein (caleb.epstein_at_[hidden])
Date: 2006-04-20 15:21:47


Beman, is this on your radar screen at all? I last tried to ping you about
it in March.

To refresh, Boost.Filesystem seems to use the POSIX pathconf call
excessively, which hurts performance when doing recursive operations like
UNIX's find(1) command.

On 3/7/06, Caleb Epstein <caleb.epstein_at_[hidden]> wrote:
>
> On 3/6/06, Caleb Epstein <caleb.epstein_at_[hidden]> wrote:
>
> > On 3/4/06, Caleb Epstein <caleb.epstein_at_[hidden] > wrote:
> >
> > > On 3/4/06, Ion Gaztañaga <igaztanaga_at_[hidden] > wrote:
> > >
> > > > Now that filesystem is proposed for the standard I would like to ask
> > > > boosters (and Beman, of course) if they find these performance
> > > > concerns
> > > > serious enough.
> > >
> > >
> > > Perhaps if they were accompanied by some comparative performance
> > > benchmarks or profile analysis?
> > >
> >
> > In the interests of science, I wrote a small "file finder" using
> > Boost.Filesystem and a comparable version using POSIX functions ( e.g.
> > stat, readdir, etc). The POSIX version runs MANY times faster than the
> > Boost.Filesystem version (code attached). Note that the Boost version
> > makes use of the "status" member of the directory iterator which is in CVS
> > and is aimed at reducing the number of operating system calls that the
> > library needs to make.
> >
>
> It seems that this profiling output might be slightly misleading. I don't
> think it is taking into account the time spent in system calls.
>
> Using the Google CPU Profiler, I reran my tests on the Boost.Filesystemversion of the file finder and I find that the bulk of the runtime is spent
> not in string manipulation, but in calls to "statfs". It appears that this
> is coming from basic_directory_iterator::m_init, where it is calling
> pathconf:
>
> #0 0x4020d9c0 in statfs () from /lib/tls/libc.so.6
> #1 0x401dfb72 in pathconf () from /lib/tls/libc.so.6
> #2 0x40022f89 in boost::filesystem::detail::dir_itr_first
> (handle=@0x8051184,
> buffer=@0x8051188, dir=@0xbfd8f79c, target=@0xbfd8f798)
> at libs/filesystem/src/operations.cpp:1178
> #3 0x0804d231 in
> boost::filesystem::basic_directory_iterator<boost::filesystem::basic_path<std::string,
> boost::filesystem::path_traits> >::m_init (
> this=0xbfd8f85c, dir_path=@0x80510d0) at operations.hpp:881
> #4 0x0804d903 in basic_directory_iterator (this=0xbfd8f85c,
> dir_path=@0x80510d0) at operations.hpp:911
> #5 0x0804ad72 in finder (root=@0xbfd8f57c, stats=@0xbfd8f8e0,
> recursive=true)
> at finder-fs.cpp:39
> #6 0x0804b07d in main (argc=6, argv=0xbfd8f9b4) at finder-fs.cpp:90
>
> This call accounts for 75% of the program's runtime according to Google's
> profiler (see attached PDF)
>
> If I change the code in dir_itr_first to use the constant value NAME_MAX
> instead of retrieving this value via pathconf, the runtime for the
> Boost.Filesytem test and my "pure POSIX" version are nearly identical.
> Both run in about 0.8 seconds once the buffer cache has been seeded.
>
> I understand why pathconf is being used for portability's sake, but it
> seems like its a real performance killer. Perhaps the NAME_MAX or MAXNAMLEN
> value could be used on platforms where it is defined?
>
> --
> Caleb Epstein
> caleb dot epstein at gmail dot com
>
>

--
Caleb Epstein
caleb dot epstein at gmail dot com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk