About Blaze: Do they have their own implementation of a
matrix-matrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor
performance”) for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with
Blaze, but Blaze is taking it a step further (I am not sure how) and
they are getting better performance than the underlying GEMM. Their
benchmarks indicate that they are faster than MKL
(
https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
They use a log scale for the benchmarks. IMHO that does not make any sense. On this benchmark they only
are better for matrix dimensions smaller than 100. Even if you run the same implementations twice you get
fluctuations of that magnitude. At 1000 its identical. And outside of the C++ word I never saw log-scales for
MFLOPS benchmarks. It makes sense if you compare the runtime of a O(N^k) algorithm. But I don’t see the
point for illustrating performance. All this started with the BTL (Benchmark Template Library).
But I will look into the BLAZE code to make sure (as in proof).