Boost logo

Boost Users :

From: jhr.walter_at_[hidden]
Date: 2003-03-01 04:47:08


Hi Stephen,

you wrote:

> > > I wrote a small piece of C code to calculate a symmetric-matrix by
> > > row-matrix product by directly accessing the memory (which btw is
about
> > > 1000% faster than ublas's generated code with gcc 3.2.3 and -O3, why
is
> > > ublas so slow?)
> >
> > If you have *not* defined NDEBUG, ublas uses many internal consistency
> > checks and usually no expression templates.
>
> I was referring to the case when NDEBUG is defined and comparing
> symmetric_matrix by general_matrix products. uBLAS is quite slow with this
> operation. However, I compared my symmetric_matrix by
> general_matrix_transpose product and ublas was only 3-4 times slower than
> the optimized C, not too bad.

OK, I've noticed that you use something like

C = prod (A, B);

As far as I understand, your matrices are alias free, so you could try

C.assign (prod (A, B));

instead.

> > > And when I compile with -DNDEBUG the results differ
> > > by around +/- 1e-16, but when its not defined the results match almost
> > > perfectly.
> >
> > You've probably hit a bug, may be an old one (please consider to check
> > against CVS, too). Could you please send or post a small sample, if the
> > problem persists?
>
> I think I know the source of the problem now, and I'm not sure if it's one
> that boost should handle or not. I attached my test code at the end of
> this email, only 2.7k, I hope it isn't inappropriate for the list.
>
> 387 floating point registers are stored as 80 bits internally and rounded
to
> 64 bits when assigned to memory. SSE2 floating-point registers are stored
in
> the native 64-bit format. The following table lists the compiler flags
used
> and the accumulated absolute difference between the two products over 100
> randomly generated matrices.
>
> gcc version 3.2.3 20030221 (Debian prerelease)
>
> no compiler flags
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -DNDEBUG
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O1
> diff sum
>
[5,3]((2.91694e-15,2.65348e-15,2.7027e-15),(4.4049e-15,3.29251e-15,3.44863e-
15),(4.10436e-15,4.14816e-15,3.24393e-15),(3.57873e-15,3.75741e-15,2.64068e-
15),(5.03091e-15,2.79637e-15,4.06099e-15))
>
> -O1 -DNDEBUG
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O2
> diff sum
>
[5,3]((4.37584e-15,4.8659e-15,3.61516e-15),(4.61783e-15,3.77736e-15,4.04169e
-15),(4.46691e-15,5.10876e-15,5.00576e-15),(3.64118e-15,4.91968e-15,5.17641e
-15),(3.97273e-15,4.25528e-15,3.75394e-15))
>
> -O2 -DNDEBUG
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O3
> diff sum
>
[5,3]((4.96131e-15,3.88968e-15,4.34245e-15),(5.07754e-15,5.91974e-15,4.19109
e-15),(4.89539e-15,4.29279e-15,5.06344e-15),(4.79781e-15,4.66033e-15,3.77996
e-15),(4.75227e-15,4.72539e-15,3.33414e-15))
>
> -O3 -DNDEBUG
> diff sum
>
[5,3]((4.60569e-15,2.67928e-15,2.7079e-15),(3.21054e-15,2.98112e-15,2.61249e
-15),(3.91701e-15,3.60822e-15,2.93515e-15),(3.23006e-15,3.42434e-15,3.53363e
-15),(3.18105e-15,2.77035e-15,3.67198e-15))
>
> -O1 -mfpmath=sse,387 -march=pentium4 -msse2
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O1 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O2 -mfpmath=sse,387 -march=pentium4 -msse2
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O2 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O3 -mfpmath=sse,387 -march=pentium4 -msse2
> diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
>
> -O3 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
> diff sum
[5,3]((3.62991e-15,3.74006e-15,4.62824e-15),(4.00548e-15,4.5311e-15,3.60129e
-15),(3.48332e-15,5.11743e-15,3.08174e-15),(3.49373e-15,4.05752e-15,4.2865e-
15),(4.81169e-15,4.51549e-15,4.18589e-15))
>
> Is this perhaps a compiler bug?

At least an issue ;-) Building your program with Intel 7.0 (boost default
settings) results in

diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

for both debug and release builds. Setting the magical compiler
switch -ffloat-store seems to heal the problem for GCC, too.

> Part of the problem actually comes from
> whether the products are accumulated into a register by the compiler, or
> whether they are accumulated into memory. If accumulated into a 387
> register it has higher-precision before it is rounded back down to 64-bits
to be
> stored into memory. If the compiler generates code to accumulate the
> products directly into memory then the results of each individual product
> are rounded to 64-bits at every iteration and the result will be slightly
> less accurate. Now, when the -msse2 flag is used, the registers are 64-bit
> only and thus no accuracy is gained by using them, only speed.

Thanks,

Joerg


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net