Hi Michael,

as for the block size, you could choose it somewhat better.

Both MR and NR should be at least the size of a L1 cacheline (i.e., 16 for double and 32 for float on Intel), moreover you should try to align your blocks to L1 cacheline (e.g. using posix_memalign(), or the cache aligned allocator from TBB)

This alone brought something like a 50% increase in performance in my tests.  This also means that the original implementation stops being competitive around at around 50*50 squares, as opposed to 75*75.

I have some ideas about the other parameters too + I will plan to have a look with perf during the weekend.

Dynamic vs. static block size: A wild guess would be that dynamic probably worths to do above a different size.  Should measure though.

Cheers,

Imre


On Friday, 22 January 2016, 0:30, "ublas-request@lists.boost.org" <ublas-request@lists.boost.org> wrote:
From: Michael Lehn <michael.lehn@uni-ulm.de>
To: ublas mailing list <ublas@lists.boost.org>
Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B@uni-ulm.de>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael