Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-27 10:41:19
> - It is row major, as I can think easier that way. So it needs a different packing routine.
Having specialised packing routines for a certain micro-kernel is ok and not unusual. In general that
is what you need to achieve the best performance. On some machines the broadcast instruction (filling
all slots of a vector register) is really slow. In this case it pays off to pack aligned duplicates of the value
into the puffer. So instead of broadcasting the (possibly) faster mov-instruction can be used.
This was at least the case for the Intel Core 2-Duo. Since Haswell the broadcast is much faster. But
thats just an example.
But back to the point: You can consider packing+ugemm being a single component that internally play along.
> - It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.
> - It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)
Thats also ok. You are packing the blocks the way the ugemm kernel wants them.
> - It is really sensitive to block size. On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32. Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.
Yes, if you really want peak performance on different machines you will need more than one micro-kernel. Even
on a particular machine you sometime have to switch between 4x8, 8x4, 2x16 kernels. And these can be realised
by different algorithms. But in my opinion these are details. The design should be such, that packing+ugemm is
a component that easily can be added (just adding a ugemm_myimpl.hpp) somewhere. I think packing+ugemm
is just a detail that requires Fleissarbeit or experts like the BLIS guys. If you can ship a version of uBLAS with
reasonable good micro-kernels it should be more than good enough for the moment.
More important would be choosing the block sizes MC, KC and NC at runtime depending on the problem size. If
you look at my benchmarks
You see there is a sharp bend at N=1600 and B=2300. If you increase the number of threads it becomes even
more obvious. Thats easy to explain as 1600/256 = 6.25 and 2300/256 = 8,98.. While on this machine MC should
be about 256 it should vary by a multiple of MR such that all blocks have about the same size. Again this is a detail
and as my code is indented to be simple and suitable for teaching it is ok. But the right thing is simply a function
that (could) compute MC, KC, NC depending on the element type and matrix dimensions.