|
Ublas : |
From: Gunter Winkler (guwi17_at_[hidden])
Date: 2005-06-11 17:20:23
Am Donnerstag, 9. Juni 2005 12:18 schrieb Toon Knapen:
> I am a bit surprised. I would have expected that [RC]RC would provide
> optimal performance because (generally) the prod is calculated by
> axpy-ing rows of A with columns of B. Axpy-prod however is slowest on
axpy_prod only uses the storage orientation of the result to choose the
algorithm.
The (dense) row major lhs case uses this assignment:
for (size_type i = 0; i < size1; ++ i)
for (size_type j = 0; j < size2; ++ j)
row (m, i).plus_assign (e1 () (i, j) * row (e2 (), j));
This is clearly bad if the 2nd factor is column major.
> these configs. BTW: what is 'opb'?
X = opb_prod( A, B);
read "outer products": X is computed as a sum of outer products of
columns of A and rows of B.
I improved the dense axpy_prod algorithm selection by adding two new
special products (see attached patch) and got the following results:
size of metrices - 500x500*500x500
prod axpy opb block goto atlas rkc rck krc kcr crk ckr
RRR 2.8 1.69 1.73 0.49 0.46 0.12 1.7 3.76 2.05 7.08 2.97 7.1
RRC 1.66 1.72 1.74 0.49 0.47 0.11 3.52 1.83 2.85 7.14 1.9 7.09
RCR 6.06 1.71 1.72 0.51 0.45 0.13 1.76 6.62 2 4.01 6.7 3.07
RCC 2.61 2.78 1.74 0.49 0.45 0.13 3.52 2.97 2.8 4.04 3.89 3.04
CRR 2.61 2.62 1.72 0.5 0.46 0.13 3.02 3.8 4.04 2.81 2.92 3.44
CRC 1.67 1.68 1.75 0.47 0.46 0.12 6.95 1.92 6.99 2.84 1.85 3.43
CCR 6.06 1.67 1.7 0.51 0.45 0.12 3.07 6.65 3.99 2.01 6.65 1.78
CCC 2.82 1.63 1.72 0.49 0.47 0.11 7 3.02 6.96 2.05 3.84 1.74
now axpy_prod is not slower than prod and always faster than the
naive implementation. But all these timings are IMO quite useless
because the multiplication of large square matrices is a very special
case. All times strongly depend on the sizes:
$ ./sample83 200000 8 8 10 2>xxx
size of matrices - 200000x8*8x8
prod axpy opb block goto atlas rkc rck krc kcr crk ckr
RRR 0.61 0.98 2.74 1.59 1.49 0.53 1.07 1.51 2.52 16.88 2.76 16.87
RRC 0.6 1.01 2.77 1.58 1.46 0.53 1.04 1.36 2.6 16.85 2.76 16.87
RCR 0.73 0.93 2.23 1.59 1.46 0.46 1.05 1.75 2.13 11.75 3.08 11.75
RCC 0.73 1.27 2.24 1.58 1.47 0.46 1.02 1.44 2.12 11.76 3.05 11.75
CRR 1.72 1.79 3.31 1.5 1.39 0.54 1.21 1.53 2.66 8.05 2.09 8.05
CRC 1.74 1.8 3.31 1.49 1.38 0.53 1.24 1.32 2.68 8.04 1.95 8.05
CCR 1.11 2.36 2.78 1.51 1.37 0.45 1.28 1.8 1.9 2.59 1.96 2.52
CCC 0.98 2.36 2.81 1.5 1.38 0.44 1.28 1.53 1.92 2.59 1.79 2.53
(divide all times by 10) Here prod() is the best choice and the
difference to atlas is much smaller. These product frequently appears
while applying a coordinate transformation to a list of points or
while computing test function values and gradients for a FEM code.
mfg
Gunter