Hi All,
what's
next? I mean what is the development process for ublas?
Now we have a
C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod). What will
we do with that?
As far as I see
we have the following options:
1)
Create a C++ template magic implementation out of it.
But for this, at the least we would need compile-time
access to the target instruction set. Any idea how to
do that?
2)
Create a compiled library implementation out of it, and
choose the implementation run-time based on the CPU
capabilities.
3)
Include some good defaults/defines, and hope the user
will use them.
4) Don't include it, and do something
completely different.
What do you
think?
At the moment I am not sure, but pretty sure, that you
don’t have to rewrite uBLAS to support good performance.
uBLAS is a pretty fine
library and already has what you need. So just extend it.
At least for 383 lines of code :-)
As I was pretty busy the last days, so I could not continue
until today. I made some minimal modifications to adopt the
GEMM implementation
to take advantage of uBLAS:
- The GEMM frame algorithm and the pack functions now work
with any uBLAS matrix that supports element access through the
notation A(i,j)
- Only for matrix C in the operation C <- beta*C +
alpha*A*B it requires that the matrix is stored row-major or
col-major. The other matrices can
be expressions. Here I need some help to get rid of this
ugly lines of code:
TC *C_ = &C(0,0);
const size_type incRowC = &C(1,0) - &C(0,0);
const size_type incColC = &C(0,1) - &C(0,0);
- Things like the following should work without performance
penalty (and in particularly without creating temporaries):
blas::axpy_prod(3*A1+A2,
B1 - B2, C, matprodUpdate)
- And matrices A and B can be symmetric or whatever. Of
course if A or B is triangular a specialized implementation
can take advantage
- The storage format does not matter. You can mix
row-major and col-major, packed, …
Here is the page for this:
Please note, creating the benchmarks for the symmetric
matrix-matrix product really takes long because the current
uBLAS implementation seems
to be much slower than for general matrices. So I reduced
the problem size for now. It would be helpful if some people
could reproduce the benchmarks
and also check different cases:
- different expressions
- different element types, e.g. A with floats, B with
double etc. At the moment there is only a micro kernel for
double. The performance depends on the
common type of A and B. So with complex<..> the
reference implementation. But the performance should at least
be constant in this case.
It would in particular be nice to have benchmarks from
people with an Intel Sandybridge or Haswell as the micro
kernels are optimized for these
architectures. If interested I can extend the benchmarks
to compare with Intel MKL. For Linux there is a free
non-commercial version available.
Cheers,
Michael