Boost logo

Boost :

From: Martin Schulz (Martin.Schulz_at_[hidden])
Date: 2007-03-29 16:32:29


Anybody out there?

Please tell me that I am not the only person on that mailing list to
spot that those flop rates posted beforehand are much too low compared
to what a 2.8 Ghz P4 should be able to deliver?

So, for a closer look, I get out the compiler, VC8 in my case.
Example14, ok. Run. First line appears, then the program hangs for
minutes. Oh no, yeah, that was the debug build.... ^C.

Ok, once again in release mode. Works better, but the figures still are
disapointing:

f(x,y,z) took 4.218 seconds to run 1e+009 iterations with double =
9.48317e+008 flops
f(x,y,z) took 5.047 seconds to run 1e+009 iterations with
quantity<double> = 7.9255e+008 flops
g(x,y,z) took 4.219 seconds to run 1e+009 iterations with double =
9.48092e+008 flops
g(x,y,z) took 4.625 seconds to run 1e+009 iterations with
quantity<double> = 8.64865e+008 flops

950 MFlops, already better than the numbers posted beforehand. A brief
look at the code. No memory referenced. Just local variables. That one
should perform better. The zero-overhead is not so zero, after all. Once
again - no avail.

Another look at the code. Oh, what is this "if" in that loop? So measure
that "if" alone:

inline
double f2(double x,double y,double z)
{
    double V = 0,
            C = 0;

    for (int i = 0; i < TEST_LIMIT; ++i)
    {
        if (i % 100000 == 0)
            C = double(std::rand())/RAND_MAX;
        
        //V = V + ((x + y) * z * C);
    }

    return V;
}

That gives:
f2(x,y,z) took 3.187 seconds to run 1e+009 iterations with double =
1.2551e+009 (would be) flops
f2(x,y,z) took 3.141 seconds to run 1e+009 iterations with
quantity<double> = 1.27348e+009 (would be) flops

Ooops, the loop alone, even without any floating point, takes more than
3 seconds? More overhead than payload?

Another one:

inline
double f3(double x,double y,double z)
{
    double V = 0,
            C = 0;

    for (int i = 0; i < TEST_LIMIT; ){
      C = double(std::rand())/RAND_MAX;
      const int next_limit = std::min(TEST_LIMIT, i+100000);

      for (; i < next_limit; ++i){
           V = V + ((x + y) * z * C);
         }
      };
    return V;
}

That gives:
f3(x,y,z) took 1.515 seconds to run 1e+009 iterations with double =
2.64026e+009 flops
f3(x,y,z) took 4.656 seconds to run 1e+009 iterations with
quantity<double> = 8.59107e+008 flops

2.6 GFlops? That is ok for a single thread. But the zero-overhead
appears to be a factor of 3 now!

What do you say? Gcc would be better? So switch to linux box. g++ -O3.
Looks better right from the start. Even though the P4 is supposed to be
slower than the Core 2.

f(x,y,z) took 3.22 seconds to run 1e+09 iterations with double =
1.24224e+09 flops
f(x,y,z) took 3.21 seconds to run 1e+09 iterations with quantity<double>
= 1.24611e+09 flops
f2(x,y,z) took 3.22 seconds to run 1e+09 iterations with double =
1.24224e+09 (would be) flops
f2(x,y,z) took 3.22 seconds to run 1e+09 iterations with
quantity<double> = 1.24224e+09 (would be) flops
f3(x,y,z) took 0.51 seconds to run 1e+09 iterations with double =
7.84314e+09 flops
f3(x,y,z) took 0.65 seconds to run 1e+09 iterations with
quantity<double> = 6.15385e+09 flops

Oh, but what is that 7.84 Gflops over there? That one goes beyond the
peak performance of the processor! GCC must be cheating here! Hmm. What
does the intel compiler give?

f(x,y,z) took 4.2 seconds to run 1e+09 iterations with double =
9.52381e+08 flops
f(x,y,z) took 5.29 seconds to run 1e+09 iterations with quantity<double>
= 7.56144e+08 flops
f2(x,y,z) took 4.19 seconds to run 1e+09 iterations with double =
9.54654e+08 (would be) flops
f2(x,y,z) took 4.18 seconds to run 1e+09 iterations with
quantity<double> = 9.56938e+08 (would be) flops
f3(x,y,z) took 0.47 seconds to run 1e+09 iterations with double =
8.51064e+09 flops
f3(x,y,z) took 6.95 seconds to run 1e+09 iterations with
quantity<double> = 5.7554e+08 flops

Hmm. Even more cheating more on plain doubles but does not seem to like
the templates. For this one, the overhead increases to nearly a factor
of 15!

So lets play a bit further,... What are those funny "inline" for? Lets
try to #define them away,.... G++ -O3 again.

f(x,y,z) took 3.25 seconds to run 1e+09 iterations with double =
1.23077e+09 flops
f(x,y,z) took 9.96 seconds to run 1e+09 iterations with quantity<double>
= 4.01606e+08 flops
f2(x,y,z) took 3.23 seconds to run 1e+09 iterations with double =
1.23839e+09 (would be) flops
f2(x,y,z) took 3.19 seconds to run 1e+09 iterations with
quantity<double> = 1.25392e+09 (would be) flops
f3(x,y,z) took 0.52 seconds to run 1e+09 iterations with double =
7.69231e+09 flops
f3(x,y,z) took 10.2 seconds to run 1e+09 iterations with
quantity<double> = 3.92157e+08 flops

Ouch, f3 on quantity<double> had been 0.65 seconds beforehand, now it is
10 seconds.
Somehow gcc forgot to cheat (eh, .."optimize") here. And even the
original example f gets about 3 times slower now. There are only about
1000 (even non-virtual) function calls involved. These can impossibly
sum up to 6 or even 10 seconds. Something different is going on here....

Well, it is getting late, I will stop here. So for the long and the
short, I dont believe the zero-overhead. Not with the compilers I
currently have at hand. Furthermore, the example is simply not
meaningfull; it allows the compilers to play so many tricks that the
resulting numbers are little more than noise.

Matthias,

I therefore have 3 further points for the "application domain and
restrictions" page:

- By the use of the library, performance of the debug build of your
software may or may not degrade by several orders of magnitude,
depending on the actual code.

- The library is very demanding on the compiler. Apart from the
compatibility requirements, the performance penalty induced by the use
of the library mostly varies between zero and three, even higher numbers
have been observed in very special cases. [btw, did anybody a
comparision of compile-times on reasonably sized projects?]

- The use of this library may impose additional obstacles when doing
in-depth performance tuning for numerical computations, as the compilers
may or may not recognize certain optimization possiblities anymore.

I would have liked to give you more positive feedback,
        Martin.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk