Boost logo

Boost Users :

From: Stephen Crowley (stephenc_at_[hidden])
Date: 2003-02-28 12:29:47


On Fri, Feb 28, 2003 at 09:53:56AM +0100, Joerg Walter wrote:
> > I wrote a small piece of C code to calculate a symmetric-matrix by
> > row-matrix product by directly accessing the memory (which btw is about
> > 1000% faster than ublas's generated code with gcc 3.2.3 and -O3, why is
> > ublas so slow?)
>
> If you have *not* defined NDEBUG, ublas uses many internal consistency
> checks and usually no expression templates.

I was referring to the case when NDEBUG is defined and comparing
symmetric_matrix by general_matrix products. uBLAS is quite slow with this
operation. However, I compared my symmetric_matrix by
general_matrix_transpose product and ublas was only 3-4 times slower than
the optimized C, not too bad.

> > And when I compile with -DNDEBUG the results differ
> > by around +/- 1e-16, but when its not defined the results match almost
> > perfectly.
>
> You've probably hit a bug, may be an old one (please consider to check
> against CVS, too). Could you please send or post a small sample, if the
> problem persists?

I think I know the source of the problem now, and I'm not sure if it's one
that boost should handle or not. I attached my test code at the end of this
email, only 2.7k, I hope it isn't inappropriate for the list.

387 floating point registers are stored as 80 bits internally and rounded to
64 bits when assigned to memory. SSE2 floating-point registers are stored in
the native 64-bit format. The following table lists the compiler flags used
and the accumulated absolute difference between the two products over 100
randomly generated matrices.

gcc version 3.2.3 20030221 (Debian prerelease)

no compiler flags
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-DNDEBUG
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O1
diff sum
[5,3]((2.91694e-15,2.65348e-15,2.7027e-15),(4.4049e-15,3.29251e-15,3.44863e-15),(4.10436e-15,4.14816e-15,3.24393e-15),(3.57873e-15,3.75741e-15,2.64068e-15),(5.03091e-15,2.79637e-15,4.06099e-15))

-O1 -DNDEBUG
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O2
diff sum
[5,3]((4.37584e-15,4.8659e-15,3.61516e-15),(4.61783e-15,3.77736e-15,4.04169e-15),(4.46691e-15,5.10876e-15,5.00576e-15),(3.64118e-15,4.91968e-15,5.17641e-15),(3.97273e-15,4.25528e-15,3.75394e-15))

-O2 -DNDEBUG
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O3
diff sum
[5,3]((4.96131e-15,3.88968e-15,4.34245e-15),(5.07754e-15,5.91974e-15,4.19109e-15),(4.89539e-15,4.29279e-15,5.06344e-15),(4.79781e-15,4.66033e-15,3.77996e-15),(4.75227e-15,4.72539e-15,3.33414e-15))

-O3 -DNDEBUG
diff sum
[5,3]((4.60569e-15,2.67928e-15,2.7079e-15),(3.21054e-15,2.98112e-15,2.61249e-15),(3.91701e-15,3.60822e-15,2.93515e-15),(3.23006e-15,3.42434e-15,3.53363e-15),(3.18105e-15,2.77035e-15,3.67198e-15))

-O1 -mfpmath=sse,387 -march=pentium4 -msse2
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O1 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O2 -mfpmath=sse,387 -march=pentium4 -msse2
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O2 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O3 -mfpmath=sse,387 -march=pentium4 -msse2
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))

-O3 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG
diff sum [5,3]((3.62991e-15,3.74006e-15,4.62824e-15),(4.00548e-15,4.5311e-15,3.60129e-15),(3.48332e-15,5.11743e-15,3.08174e-15),(3.49373e-15,4.05752e-15,4.2865e-15),(4.81169e-15,4.51549e-15,4.18589e-15))

Is this perhaps a compiler bug? Part of the problem actually comes from
whether the products are accumulated into a register by the compiler, or
whether they are accumulated into memory. If accumulated into a 387 register
it has higher-precision before it is rounded back down to 64-bits to be
stored into memory. If the compiler generates code to accumulate the
products directly into memory then the results of each individual product
are rounded to 64-bits at every iteration and the result will be slightly
less accurate. Now, when the -msse2 flag is used, the registers are 64-bit
only and thus no accuracy is gained by using them, only speed.

-- 
Stephen
[Non-text portions of this message have been removed]

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net