
Ublas : 
Subject: [ublas] General (specialize)d optimized matrix product proposal
From: Jörn Ungermann (consulting_at_[hidden])
Date: 20100321 18:28:04
Dear all!
Abstract;
The (lacking) performance of sparsematrix products was quite often
noted on this list.
Currently there are several functions which implement matrix products.
Each is efficient under certain conditions only and some of them are not
even documented.
I think it important for users to have one (near)optimal function only.
The attached file contains an improved axpy_prod that matches the
performance of "prod", "axpy_prod" and "sparse_prod" and is thereby
optimal for all matrix types (see table below).
Details:
The optimal choice of kernel for a matrix product depends on the
properties of all involved matrices. The current implementations
specialize only on the type of target matrix. By also determining the
kernel depending on the source matrices, one can choose the most
efficient kernel.
My aim was to create a single function to match the performance of prod,
axpy_prod and sparse_prod for all combinations of
compressed_matrix/matrix and column/rowmajority.
The other matrix types should also be handled efficiently by my product,
too, but I did check it only spuriously, as it should be obvious that
those types are more suited for matrix setup, not for actual
calculation.
My axpy_prod implementation (called axpy_prod2 in the attached file)
does not offer support for the undocumented triangular_expression stuff
contained in the original axpy_prod, which however seems to be buggy in
the current code for dense matrices.
The kernels are largely based on existing kernels with one or two new
ones being thrown in. They are as abstract as possible to handle
arbitrary expressions efficiently. Specializing, e.g. directly on
compressed_matrix would give another very significant performance boost,
but also many more additional kernels, possibly more than could be
maintained, especially as the transposedness might have to be handled
explicitly, too.
It would be very nice, if someone could rewrite prod/prec_prod to handle
matrix products in the same way as my axpy_prod2 does, but I did not
look deep enough into the expressiontemplates to do this myself or to
even know if this were possible.
In fact, I'd propose to have two sets of interfaces:
1) One convenient one, possibly compromising efficiency
2) One modeled closely after CBLAS, delivering utmost efficiency.
The latter one could then be *very easily* overloaded by the numeric
bindings for dense matrices. I added a possible generic implementation
for a gemm call that could be trivially overloaded for dense matrices
and forwarded to, e.g., ATLAS numeric bindings.
If one could achieve the same efficiency and automatic (i.e. by
including a header) coupling to numeric bindings using only *one*
interface, I'd prefer that naturally. However currently, we have not
just two, but too many product functions (prod, prec_prod, axpy_prod,
sparse_prod, opb_prod, block_prod).
The following table gives the result for all 64 combinations of
compressed_matrix/matrix and row/columnmajorities for the three
involved in this case 2000x2000 matrices.
com_rm is a compressed_matrix of row_major type.
den_cm is a matrix of column_major type.
The 4th column indicates the used kernel.
The 5th column gives the runtime for axpy_prod2 ( clock()/1000 )
The 6th column gives the runtime for sparse_prod ( clock()/1000 )
The 7th column gives the runtime for axpy_prod ( clock()/1000 )
The 8th column gives the runtime for prod ( clock()/1000 )
The 10th column gives the speedup of axpy_prod2 compared to sparse_prod.
The 11th column gives the speedup of axpy_prod2 compared to axpy_prod.
The 12th column gives the speedup of axpy_prod2 compared to prod.
Larger matrix sizes result in prohibitive runtimes for the "slow"
products, but can be used to analyse puresparse products.
The runtime shall be taken only qualitatively.
One can see that the only cases where the new implementation is slower
are of relatively small runtime, so it may be negligible. sparse_prod
uses an optimization that is very efficient if the target matrix has few
offdiagonal elements, but is very inefficient if it does.
The results will therefore vary depending on the test matrices.
It is also obvious to see, why some people complain about product
performance, as especially axpy_prod and prod are sometimes ridiculously
slow for sparse matrices and sparse_prod does not seem to be documented
in the special products section.
My favorite case is "com_cm, com_cm, den_rm" one, which actually
occurred in the diagnostics part of our application and was the reason
why we started looking into this topic.
One might ask, if all these combinations are valid use cases. I'd say
most of them are. Using trans() on an input parameter changes the
majority and should therefore happen quite often. Writing the product of
sparse matrices to a dense matrix seems also sensible as this product
will often not be sparse anymore.
com_rm
com_rm, com_rm, com_rm .ururur 10. 10. 10. 88340. spdup: 1.00 1.00 8834.00
com_rm, com_rm, den_rm .ururur 40. 30. 3270. 87490. spdup: 0.75 81.75 2187.25
com_rm, com_rm, com_cm .ucucuc 320. 340. 360. 85750. spdup: 1.06 1.12 267.97
com_rm, com_rm, den_cm .ururur 20. 380. 193010. 87020. spdup: 19.00 9650.50 4351.00
com_rm, den_rm, com_rm .ururur 210. 840. 4210. 2100. spdup: 4.00 20.05 10.00
com_rm, den_rm, den_rm .ururur 220. 890. 16540. 2080. spdup: 4.05 75.18 9.45
com_rm, den_rm, com_cm .urdcuc 800. 90010. 100030. 2050. spdup: 112.51 125.04 2.56
com_rm, den_rm, den_cm .urdcuc 630. 93330. 187840. 2190. spdup: 148.14 298.16 3.48
com_rm, com_cm, com_rm .ururur 310. 310. 330. 670. spdup: 1.00 1.06 2.16
com_rm, com_cm, den_rm .ururur 310. 320. 184540. 710. spdup: 1.03 595.29 2.29
com_rm, com_cm, com_cm .ucucuc 310. 310. 280. 660. spdup: 1.00 0.90 2.13
com_rm, com_cm, den_cm .ucucuc 330. 310. 182840. 720. spdup: 0.94 554.06 2.18
com_rm, den_cm, com_rm .ururur 300. 910. 4310. 2090. spdup: 3.03 14.37 6.97
com_rm, den_cm, den_rm .ururur 340. 970. 72690. 2120. spdup: 2.85 213.79 6.24
com_rm, den_cm, com_cm .urdcuc 690. 91900. 101780. 2030. spdup: 133.19 147.51 2.94
com_rm, den_cm, den_cm .urdcuc 540. 92090. 181780. 2000. spdup: 170.54 336.63 3.70
com_cm
com_cm, com_rm, com_rm .ururur 40. 60. 60. 173020. spdup: 1.50 1.50 4325.50
com_cm, com_rm, den_rm .ucurdr 40. 70. 3200. 165360. spdup: 1.75 80.00 4134.00
com_cm, com_rm, com_cm .ucucuc 50. 60. 60. 166080. spdup: 1.20 1.20 3321.60
com_cm, com_rm, den_cm .ucucuc 70. 80. 3250. 168970. spdup: 1.14 46.43 2413.86
com_cm, den_rm, com_rm .ururur 160. 860. 4480. 94040. spdup: 5.38 28.00 587.75
com_cm, den_rm, den_rm .ucurdr 210. 840. 16320. 91700. spdup: 4.00 77.71 436.67
com_cm, den_rm, com_cm .ucucuc 680. 670. 6480. 92700. spdup: 0.99 9.53 136.32
com_cm, den_rm, den_cm .ucucuc 700. 770. 3410. 94180. spdup: 1.10 4.87 134.54
com_cm, com_cm, com_rm .ururur 350. 330. 350. 83880. spdup: 0.94 1.00 239.66
com_cm, com_cm, den_rm .ucucuc 30. 360. 186920. 82810. spdup: 12.00 6230.67 2760.33
com_cm, com_cm, com_cm .ucucuc 10. 10. 10. 87320. spdup: 1.00 1.00 8732.00
com_cm, com_cm, den_cm .ucucuc 40. 20. 3350. 86040. spdup: 0.50 83.75 2151.00
com_cm, den_cm, com_rm .ururur 310. 870. 4830. 94410. spdup: 2.81 15.58 304.55
com_cm, den_cm, den_rm .ucucuc 600. 900. 76000. 93800. spdup: 1.50 126.67 156.33
com_cm, den_cm, com_cm .ucucuc 630. 600. 6410. 93480. spdup: 0.95 10.17 148.38
com_cm, den_cm, den_cm .ucucuc 620. 640. 3330. 92700. spdup: 1.03 5.37 149.52
den_rm
den_rm, com_rm, com_rm .ururur 550. 640. 6630. 96910. spdup: 1.16 12.05 176.20
den_rm, com_rm, den_rm .ururur 490. 630. 3110. 96140. spdup: 1.29 6.35 196.20
den_rm, com_rm, com_cm .ucucuc 410. 880. 4880. 98340. spdup: 2.15 11.90 239.85
den_rm, com_rm, den_cm .ururur 500. 930. 76210. 96800. spdup: 1.86 152.42 193.60
den_rm, den_rm, com_rm .ddur 72560. 204380.1736640. 70530. spdup: 2.82 23.93 0.97
den_rm, den_rm, den_rm .drdrdr 15910. 204960. 15880. 70870. spdup: 12.88 1.00 4.45
den_rm, den_rm, com_cm .dduc 62130. 225070.1728390. 62830. spdup: 3.62 27.82 1.01
den_rm, den_rm, den_cm .dduc 61330. 224670. 73000. 61530. spdup: 3.66 1.19 1.00
den_rm, com_cm, com_rm .drucur 740. 90370. 99300. 2040. spdup: 122.12 134.19 2.76
den_rm, com_cm, den_rm .ucucuc 380. 95150. 184160. 2100. spdup: 250.39 484.63 5.53
den_rm, com_cm, com_cm .ucucuc 360. 950. 4470. 2070. spdup: 2.64 12.42 5.75
den_rm, com_cm, den_cm .ucucuc 380. 980. 74490. 2150. spdup: 2.58 196.03 5.66
den_rm, den_cm, com_rm .ddur 15950. 232660.1762960. 16230. spdup: 14.59 110.53 1.02
den_rm, den_cm, den_rm .ddur 16030. 228230. 73100. 15410. spdup: 14.24 4.56 0.96
den_rm, den_cm, com_cm .dduc 15990. 224510.1716590. 16140. spdup: 14.04 107.35 1.01
den_rm, den_cm, den_cm .dduc 15460. 224790. 73850. 15580. spdup: 14.54 4.78 1.01
den_cm
den_cm, com_rm, com_rm .ururur 580. 680. 6350. 95920. spdup: 1.17 10.95 165.38
den_cm, com_rm, den_rm .ucurdr 710. 690. 3280. 97030. spdup: 0.97 4.62 136.66
den_cm, com_rm, com_cm .ucucuc 260. 840. 4810. 96100. spdup: 3.23 18.50 369.62
den_cm, com_rm, den_cm .ucucuc 230. 880. 16280. 95940. spdup: 3.83 70.78 417.13
den_cm, den_rm, com_rm .ddur 125890. 211500.1752280. 123910. spdup: 1.68 13.92 0.98
den_cm, den_rm, den_rm .drdrdr 16320. 207460. 16810. 124920. spdup: 12.71 1.03 7.65
den_cm, den_rm, com_cm .dduc 126900. 206840.1645940. 122030. spdup: 1.63 12.97 0.96
den_cm, den_rm, den_cm .dcdcdc 16390. 209770. 16810. 121270. spdup: 12.80 1.03 7.40
den_cm, com_cm, com_rm .drucur 760. 91830. 96810. 2120. spdup: 120.83 127.38 2.79
den_cm, com_cm, den_rm .drucur 610. 92990. 187750. 2280. spdup: 152.44 307.79 3.74
den_cm, com_cm, com_cm .ucucuc 190. 920. 4310. 2270. spdup: 4.84 22.68 11.95
den_cm, com_cm, den_cm .ucucuc 230. 910. 17800. 2310. spdup: 3.96 77.39 10.04
den_cm, den_cm, com_rm .ddur 66980. 240030.1864980. 65220. spdup: 3.58 27.84 0.97
den_cm, den_cm, den_rm .ddur 63400. 233410. 76700. 63820. spdup: 3.68 1.21 1.01
den_cm, den_cm, com_cm .dduc 75140. 214000.1711410. 74020. spdup: 2.85 22.78 0.99
den_cm, den_cm, den_cm .dcdcdc 16460. 215050. 16820. 74570. spdup: 13.07 1.02 4.53
The file I attached can easily be converted to a testcase. I had some
compilation troubles with the iterators forcing me to instantiate some
of them inside the loops. For efficiency, I'd like to define them
outside, but this gave trouble when calling the products from within my
gemm draft.
Compiling the file requires the util.hpp from the uBLAS test cases,
which I included for convenience's sake.
I compiled it using g++ DNDEBUG O3 to get rather comparable results.
I propose to replace the current axpy_prod matrix*matrix code by the
attached code.
Kind regards,
Joern
 JÃ¶rn Ungermann <consulting_at_[hidden]>