Hi,
I like your ideas. But i am not sure what you mean with 1. and
2.
The problem I see is, that uBLAS right now doesn't offer the
desired quality and speed for single core, let alone multi core.
Therefore I will focus on the internal infrastructure instead of
motivating new algorithms.
short description:
1. Rock solid implementation of the BLAS algorithms( "compute
kernels")
2. Rewriting of the expression mechanic.
2.1 automatic use of the compute kernels for the operations
2.2 automatic optimisation of expressions
long description
1. Rock solid implementation of the BLAS algorithms: matrix x
vector and matrix x matrix in all useful combinations
(sparse/sparse dense/dense dense/sparse ...
triangular/dense...etc there are a lot), but also: matrix
assignment. uBLAS will not be compared in #Features but in
runtime compared to Armadillo/Eigen/Atlas/GotoBlas. Having a
fast implementation of this is also crucial for multi-core and
distributed computations. Also all higher level decompositions
(Cholesky, LU, SVD, RankDecomp,...) and triangular solvers rely
on fast BLAS.
2. Rewriting of the expression mechanic. The way uBLAS is
implemented right now, writing
noalias(A)+=prod(B,C)+D
or
vec = prod(prod<matrix>(B,C),vec2);
is in most cases a bad decision. The second expression wouldn't
even compile without the template parameter.
What I would like to have is the following:
2.1 automatic use of the fast algorithms. That is the above
expression should be translated to
axpy_prod(B,C,A,false);
noalias(A) +=D;
right now the matrix_assign algorithms are used which do a
terrible job on prod(B,C).
2.2 automatic optimization of expressions. The second expression
above is really inefficient. First the matrix-matrix prod is
evaluated and the result is afterwards used for a single
matrix-vector prod.
now compare to this:
prod(B,prod<vector>(C,d));
only 2 matrix-vector products are performed which might save 99%
of the FLOPS of the first expression. So it would be better if
this would be transformed automatically. Normally the user could
do that himself but sometimes it is not possible. Consider a
Conjugate Gradient Solver:
x=CG(A,b);
a usual pattern of A is A=XX^T+c*Identity. CG uses internally
only matrix-vector products - and possibly not many of them.
Therefore you don't want to compute A because it is a) slow b)
might require more memory than is available. Without expression
optimizations it would be impossible to write a good CG that
could do that.
Greetings,
Oswin
On 07.12.2013 11:38, David Bellot wrote:
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: Oswin.Krause@ruhr-uni-bochum.de
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: athanasios.iliopoulos.ctr.gr@nrl.navy.mil