 # Boost :

From: lums_at_[hidden]
Date: 2002-06-27 22:36:40

Hi:

To follow up on this a little bit.

There are basically four optimizations that need to be incorporated
into a matrix-matrix product to get max performance (these are
basically what the vendors do).

The principles behind these optimizations: Help the compiler and the
hardware do their stuff and code what yourself they can't do. In
particular, you want to minimize memory accesses (and maximize reuse)
and use the pipeline(s) in the processor.

1) Order the loops properly. If you write a matrix-matrix product
using integer indices (i, j, k, say), there are six possible
orderings for the loops: ijk, ikj, kij, kji, jki, jik (here we
assume the inner loop is c(i,j) += a(i,k) * b(k,j)). The best
ordering will depend on the orientation (row, column) of the
matrices involved. Having k as the inner-most variable could mean
fewer loads and stores in the inner loop since the access to c(i,j)
can be hoisted out.

2) Unroll and jam the inner loop. I.e., unroll at the inner loop with
respect to the two outer indices. This allows the compiler to
enregister a number of quantities. The load/store count versus the
flop count is very favorable now. An example of an unrolled and
jammed matrix-matrix product follows:

for (i = 0; i < m; i += 4) {
for (j = 0; j < n; j += 2) {
double w00 = C[(i )*ldc + j ];
double w01 = C[(i )*ldc + j+1];
double w10 = C[(i+1)*ldc + j ];
double w11 = C[(i+1)*ldc + j+1];
double w20 = C[(i+2)*ldc + j ];
double w21 = C[(i+2)*ldc + j+1];
double w30 = C[(i+3)*ldc + j ];
double w31 = C[(i+3)*ldc + j+1];
for (k = 0; k < p; ++k) {
w00 += A[(i )*lda + k] * B[k*lda + j ];
w01 += A[(i )*lda + k] * B[k*lda + j+1];
w10 += A[(i+1)*lda + k] * B[k*lda + j ];
w11 += A[(i+1)*lda + k] * B[k*lda + j+1];
w20 += A[(i+2)*lda + k] * B[k*lda + j ];
w21 += A[(i+2)*lda + k] * B[k*lda + j+1];
w30 += A[(i+3)*lda + k] * B[k*lda + j ];
w31 += A[(i+3)*lda + k] * B[k*lda + j+1];
}
C[(i )*ldc + j ] = w00;
C[(i )*ldc + j+1] = w01;
C[(i+1)*ldc + j ] = w10;
C[(i+1)*ldc + j+1] = w11;
C[(i+2)*ldc + j ] = w20;
C[(i+2)*ldc + j+1] = w21;
C[(i+3)*ldc + j ] = w30;
C[(i+3)*ldc + j+1] = w31;
}
}

3) Block for cache. Use three more sets of indices to move block by
block through the matrix and use the core algorithm above to do
each block. Again, the ordering of the block indices will also
affect performance.

4) Do a block copy when doing a cache-blocked algorithm. This will
prevent cache conflicts for larger matrix sizes.

These optimizations have effects for different size ranges of
matrices. Optimizations 1 and 2 above are important for all matrix
sizes. Optimization 3 becomes important when the matrices won't fit
in cache. Optimization 4 becomes important when the matrix sizes are
large enough to cause cache conflicts. If you look at the graph
posted on http://www.zib.de/weiser/ublas_review.gif, you will see drop
offs in performance occurring as the matrix size increases. The first
drop-off (at about N=50) is due to not having optimization 3, and the
second (at about N=300) is due to not having optimization 4.

Now, what makes this a touchy process is that different processors
will require different values for the unrolling and jamming and
different values for the cache blocking. Different matrix
orientations can similarly require slightly different algorithms. In
addition, different compilers respond to different expressions of
optimizations in different ways.

The ATLAS effort, for instance, is a code generator that explores the
parameter space of optimizations and picks the best set of
parameters. In MTL, we incorporated these optimizations (via BLAIS
and FAST libraries) as template meta-programs so that different
parameter sets could easily be realized.

If ublas (or any linear algebra library) wants to match vendor-tuned
(or ATLAS-tuned) performance -- it will have to include the
optimizations described here and it will have to be tunable so that
different combinations of optimization parameters can be used on
different architectures.

Best Wishes,
Andrew Lumsdaine

In our last exciting episode Toon Knapen wrote:

> On Thursday 27 June 2002 11:47, Martin Weiser wrote:
> > Dear all,
> >
> > without giving a full review report (I haven't yet used uBLAS sufficiently
> > to post a meaningful statement), I'd like to draw the attention to two
> > points I haven't read so far in the discussion.
> >
> > [ If you're familiar with cache blocking techniques for linear algebra,
> > you may wish to skip right to the end of the posting. ]
> >
> > Although the abstraction penalty is mostly insignificant compared to
> > equivalent handwritten code, there remains a huge performance gap to
> > optimized numerical kernels. I've done a preliminary comparison of dgemm
> > performance with different matrix sizes N (see
> > http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
> > multiplying two square matrices of size N (N=1,...,1000). The competitors
> > are a vendor optimized BLAS implementation (Sun Performance Library) and
> > a straightforward naive "C" implementation. The environtment is GCC 3.1
> > (options -O3 -funroll-loops) on UltraSparc 10/300.

> This type of graphs would be useful as uBLAS doc !

> I also noticed this performance drop and it was to be expected unless uBLAS
> would contain blocking strategies. To obtain maximal performance, about 8
> months ago, also involving andrew lumsdaine, 2 different approaches have been
> discussed which that can serve to obtain this performance :

> 1) Bindings to BLAS
> Most high-performance computer vendors put a lot of effort into optimising
> BLAS on their machines, writing the kernels directly in assembler. I don't
> think any compiler is ever able to match this performance (first of all, a
> compiler does not know about the different cache sizes that can be exploited).
> This is at least a short-time solution.
> I've also discussed this with Joerg and, as you suggest if I understand you
> correctly, it is very hard to detect in an expression the sub-expression for
> which a blas call can be maid. So the user should state explicitly if he
> wants a blas-gemm or a ublas-prod.
> I've started providing such bindings yesterday and in attachment you can find
> my preliminary bindings.

> 2) optimising the uBLAS implementation
> The review is mostly about the interface. Optimising the C++ implementation
> for the different compilers should be looked at in a following stage IMHO.
> This is very tricky and very subtle (just look at Jeremy's thesis, stating
> the differences using a '==' or '<' condition in for loops) and needs to be
> updated every time a new compiler is released.
> This is a long-term solution.
> I do think it can really coexist with the blas-calls because for small
> matrices it's easier and, if the ublas implementation is heavily optimised,
> long expressions could be quicker (due to the loop-fusion) than several calls
> to blas (and creating temporaries). For instance, when multiplying 3
> matrices, you need 2 gemm-cals and a temporary to store the result of the
> first gemm. Here an optimised ublas should be able to outpace blas (correct
> me if I'm wrong please).

> toon

> #ifndef boost_ublas_blas_hpp
> #define boost_ublas_blas_hpp

> //
> // make sure to match the library symbols
> // taking the right convention to link C with Fortran
> #if defined(__GNUC__) || defined(__ICC)
> #define BLAS_SAXPY saxpy_
> #define BLAS_DAXPY daxpy_
> #define BLAS_CAXPY caxpy_
> #define BLAS_ZAXPY zaxpy_

> #define BLAS_DDOT ddot_
> #define BLAS_ZDOT zdotc_

> #define BLAS_DGEMM dgemm_
> #define BLAS_ZGEMM zgemm_
> #else
> #error do not how to link with fortran
> #endif
>
> //
> // define C-wrapper for (fortran) BLAS calls
> extern "C"
> {
> void BLAS_SAXPY(const int *n, const float* alpha, const float* x, const int* incx, float* y, const int* incy);
> void BLAS_DAXPY(const int *n, const double* alpha, const double* x, const int* incx, double* y, const int* incy);
> void BLAS_CAXPY(const int *n, const std::complex< float >* alpha, const std::complex< float >* x, const int* incx, std::complex< float >* y, const int* incy);
> void BLAS_ZAXPY(const int *n, const std::complex< double >* alpha, const std::complex< double >* x, const int* incx, std::complex< double >* y, const int* incy);

> double BLAS_DDOT(const int *n, const double *x, const int *incx, const double *y, const int *incy);
> std::complex< double > BLAS_ZDOT(const int *n, const std::complex< double > *x, const int *incx, const std::complex< double > *y, const int *incy);

> void BLAS_DGEMM(const char *transa, const char *transb, const int *m, const int *n, const int *k, const double *alpha, const double *a, const int *lda, const double *b, const int *ldb, const double *beta, double *c, const int *ldc);
> void BLAS_ZGEMM(const char *transa, const char *transb, const int *m, const int *n, const int *k, const std::complex< double > *alpha, const std::complex< double > *a, const int *lda, const std::complex< double > *b, const int *ldb, const std::complex< double > *beta, std::complex< double > *c, const int *ldc);
> }

> namespace boost { namespace numerics {
>
> template < typename T >
> struct blas_traits
> {
> // level 1 types
> typedef void (*axpy_type)(const int *n, const T* alpha, const T* x, const int* incx, T* y, const int* incy);
> typedef T (*dot_type)(const int *n, const T* x, const int* incx, const T* y, const int* incy);

> // level 2 types
>
> // level 3 types
> typedef void (*gemm_type)(const char *transa, const char *transb, const int *m, const int *n, const int *k, const T *alpha, const T *a, const int *lda, const T *b, const int *ldb, const T *beta, T *c, const int *ldc);

> // level 1 functions
> static axpy_type axpy ;
> static dot_type dot ;

> // level 2 functions

> // level 3 functions
> static gemm_type gemm ;
> };
>
> const int one = 1;
> const char TRANS_N = 'N';
> const char TRANS_T = 'T';
> const char TRANS_C = 'C';
>
> template < typename T >
> void axpy(const T& alpha, const vector< T >& x, vector< T >& y)
> {
> assert( x.size() == y.size() );
> const int n = x.size();
> blas_traits< T >::axpy( &n, &alpha, &x(0), &one, &y(0), &one );
> }

> template < typename T >
> T dot(const vector< T >& x, const vector< T >& y)
> {
> assert( x.size() == y.size() );
> const int n = x.size();
> return blas_traits< T >::dot( &n, &x(0), &one, &y(0), &one );
> }

> template < typename T >
> void gemm(const T& alpha, const matrix< T, column_major >& a, const matrix< T, column_major >& b, const T& beta, matrix< T, column_major >& c)
> {
> assert( a.size1() == c.size1() ); // m
> assert( b.size2() == c.size2() ); // n
> assert( a.size2() == b.size1() ); // k
> const int m = a.size1();
> const int n = b.size2();
> const int k = a.size2();
> const int lda = m;
> const int ldb = k;
> const int ldc = m;
> blas_traits< T >::gemm( &TRANS_N, &TRANS_N, &m, &n, &k, &alpha, &a(0,0), &lda, &b(0,0), &ldb, &beta, &c(0,0), &ldc );
> }

> template < typename T >
> void gemm(const T& alpha,
> const matrix< T, column_major >& a,
> const typename matrix_unary2_traits< matrix< T, column_major >, scalar_identity< T > >::result_type & b,
> const T& beta,
> matrix< T, column_major >& c)
> {
> std::cout << "gemm a trans(b)" << std::endl;
> std::cout << "addres : " << &b(0,0) << std::endl;
> assert( a.size1() == c.size1() ); // m
> assert( b.size2() == c.size2() ); // n
> assert( a.size2() == b.size1() ); // k
> const int m = a.size1();
> const int n = b.size2();
> const int k = a.size2();
> const int lda = m;
> const int ldb = n;
> const int ldc = m;
> blas_traits< T >::gemm( &TRANS_N, &TRANS_T, &m, &n, &k, &alpha, &a(0,0), &lda, &b(0,0), &ldb, &beta, &c(0,0), &ldc );
> }
> }}

> #endif // boost_ublas_blas_hpp