Boost logo

Boost :

Subject: Re: [boost] BOOST_FOREACH slow?
From: Peter Bartlett (pete_at_[hidden])
Date: 2008-11-18 13:03:35


Quoting Eric Niebler <eric_at_[hidden]>:

>
> Peter Bartlett wrote:
>> Quoting Hansi <hansipet_at_[hidden]>:
>>
>>> regarding to the benchmark link posted from Michael Marcin
>>> BOOST_FOREACH is 3 times slower then other equivalents...it is a lot...
>>>
>>> Best regards
>>> Hansjörg
>>
>> And that should make you sceptical of the results! Once you strip
>> away the TMP and macros, FOREACH is not much different from a for
>> lopp. There is an additional boolean [or so?] that looks ripe for
>> optimization.
>
> Yes, one extra boolean test/set per iteration. That's all you should
> pay for using BOOST_FOREACH when optimizations are turned on. I can't
> get rid of it without also getting rid of the ability to iterate over a
> sequence by reference.

Right. Ahead of seeing your mail, I worked through the FOREACH
implementation and once I'd removed the macros, the ?: tricks, the
if(false) tricks and the static_casts came down to the basic
implementation - I agree that there is an extra boolean that is
checked and set per iteration. I figured this is where your 5% rule of
thumb came from.

But then I realised the inner for loop (which does the setting and
checking of the boolean) is run exactly once for each run of the outer
loop. Thus it seemed likely it might be a candidate for loop
unrolling, and hence FOREACH actually be exactly efficient as a
"normal" for loop.

I thus compiled the following code:

#define _SECURE_SCL 0
#include <iostream>
#include <vector>
#include <boost/foreach.hpp>

struct obj
{
   obj(int x) : value(x) {}

   char xxx[100000];
   int value;
};

__declspec(noinline) int foreach_version(std::vector<obj> const& x)
{
   int result = 0;
   BOOST_FOREACH(obj const& i , x)
     result += i.value;

   return result;
}

__declspec(noinline) int iterator_version(std::vector<obj> const& x)
{
   int result = 0;
   for( std::vector<obj>::const_iterator c = x.begin() ; c != x.end() ; ++c )
     result += c->value;

   return result;
}

int main()
{
   std::vector<obj> foo(100,1);

   int jj = iterator_version(foo);
   int j = foreach_version(foo);

   std::cout << "\n" << j << " " << jj << "\n\n";
}

The generated assembler for the two functions was:

__declspec(noinline) int foreach_version(std::vector<obj> const& x)
{
   int result = 0;
   BOOST_FOREACH(obj const& i , x)
00401560 mov edx,dword ptr [esp+4]
00401564 mov ecx,dword ptr [edx+4]
00401567 mov edx,dword ptr [edx+8]
0040156A xor eax,eax
0040156C lea esp,[esp]
00401570 cmp ecx,edx
00401572 je foreach_version+22h (401582h)
     result += i.value;
00401574 add eax,dword ptr [ecx+186A0h]
0040157A add ecx,186A4h
00401580 jmp foreach_version+10h (401570h)

   return result;
}

////////////////////////////////

__declspec(noinline) int iterator_version(std::vector<obj> const& x)
{
   int result = 0;
   for( std::vector<obj>::const_iterator c = x.begin() ; c != x.end() ; ++c )
00401530 mov edx,dword ptr [esp+4]
00401534 mov ecx,dword ptr [edx+4]
00401537 mov edx,dword ptr [edx+8]
0040153A xor eax,eax
0040153C cmp ecx,edx
0040153E je iterator_version+20h (401550h)
     result += c->value;
00401540 add eax,dword ptr [ecx+186A0h]
00401546 add ecx,186A4h
0040154C cmp ecx,edx
0040154E jne iterator_version+10h (401540h)

   return result;
}

I.e. the FOREACH version has an extra lea instruction, but has one
fewer cmp instruction. Unfortunately I'm too much of a newbie to weigh
up the relative cost of those two instructions, but the inner for loop
does indeed appear to have been optimized at least somewhat.

Maik wrote:
> maik_at_harald:~$ g++ -DNDEBUG -O3 test.cpp -I /home/maik/workspace/boost
> maik_at_harald:~$ ./a.out
> Iterator accumulate took 0.09 seconds.
> Pointer accumulate took 0.09 seconds.
> Iterator for loop took 0.08 seconds.
> Pointer for loop took 0.09 seconds.
> Index for loop took 0.09 seconds.
> Index for_each took 0.09 seconds.
> Pointer for_each took 0.09 seconds.
> BOOST_FOREACH took 0.09 seconds.

The uniformity of these results may just be a fluke but it would be
intriguing to find out if all variations have the same generated code
- I suspect gcc also has no problems optimizing FOREACH to the point
where one can't worry about performance when using it.

Pete


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk