Boost logo

Boost :

Subject: Re: [boost] BOOST_FOREACH slow?
From: Bjørn Roald (bjorn_at_[hidden])
Date: 2008-11-19 16:50:32


Peter Bartlett wrote:
> Quoting bjorn_at_[hidden]:
>
>> Ok, do anybody know of other compilers/libs that implement loop
>> unrolling
>> in std::for_each?
>
> Are you looking for optimizations specifically for std::for_each other
> than those done by the compiler for general for loops?

Neither, or both; I am interested in what effects loop unrolling has on
performance in general, and any straight forward and practical means of
achieving that effect given that it is worth its while.

> If so, possibly of interest is the fact that new in GCC 4.3 is the
> (experimental) ability to call parallel versions of certain algorithms
> in <algorithm> and <numeric> via OpenMP:
>
> http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

I think this is for completely different use-cases than those of the
benchmark in question. Concurrency in algorithms are certainly of
interest, but I would intuitively use that if the body of the concurrent
part is of some significant size. Loop unrolling is for the case where
the body of the loop is tiny. I would be surprised if the OpenMP based
concurrency in a bench mark like this is not less efficient as far as
wall clock, and as far as CPU time it certainly will be. With optimal
partitioning of the loop iterations, it is possible an OpenMP version
will win a wall clock battle, but that is something I doubt a library
can archive. I may be wrong.

Also, it is not clear to me how concurrency in std algorithms will
effect correctness in my programs.

> This could put some distance between BOOST_FOREACH and std::for_each
> for certain use cases though I haven't experimented myself.

Ok, I gave it a quick shot for the fun of it. But I only get confused
from the results. It looks to me like the OpenMP code and flags have no
effects on std::for_each, neither does explicit calls to
__gnu_parallel::for_each give different results. It looks as if it is
falling back to the serialized version. I have not spent any time
figuring why.

I added:

int main()
{
    // don't know if these are needed, I think there may be sensible
defaults
    const int threads_wanted = 4; // I have a 4 core CPU
    omp_set_dynamic(false);
    omp_set_num_threads(threads_wanted);
   
    ....

  
#ifdef _GLIBCXX_PARALLEL
  {
    boost::timer timer;
    sum_func sum = __gnu_parallel::for_each( data.begin(), data.end(),
sum_func() );
    double elapsedtime = timer.elapsed();
    std::cout << "__gnu_parallel::for_each took "
          << elapsedtime << " seconds."
          << std::endl;
    force_calculation(sum.sum);
  }
#endif

        
to the benchmark, and compiled and executed like this:

g++ -D_GLIBCXX_PARALLEL -I../../boost_1_37_0 -march=native -fopenmp -O3 foreach_benchmark.cpp-o foreach_benchmark && foreach_benchmark
Iterator accumulate took 0.09 seconds.
Pointer accumulate took 0.09 seconds.
Iterator for loop took 0.09 seconds.
Pointer for loop took 0.09 seconds.
Index for loop took 0.09 seconds.
Index for_each took 0.09 seconds.
Pointer for_each took 0.09 seconds.
BOOST_FOREACH took 0.09 seconds.
std::for_each took 0.1 seconds.
__gnu_parallel::for_each took 0.08 seconds.

Nothing exiting, tell me why?

I guess that boost::timer may not be appropriate to measure this if we in fact run in parallel threads.

-- 
Bjørn

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk