Please note that I would vote against explicit vectorization atm.
afaik is not related to boost.simd. We can discuss about it then.
Also if you have access to an intel compiler run a test. You will be
vectorization.
As a side note in GSOC 2013 Sathyam had developed an aligned allocator.
now. This will need some modifications since compilers now have better
support for alignment.
On 01/27/2016 10:41 AM, Michael Lehn wrote:
>> - It is row major, as I can think easier that way. So it needs a different packing routine.
> Having specialised packing routines for a certain micro-kernel is ok and not unusual. In general that
> is what you need to achieve the best performance. On some machines the broadcast instruction (filling
> all slots of a vector register) is really slow. In this case it pays off to pack aligned duplicates of the value
> into the puffer. So instead of broadcasting the (possibly) faster mov-instruction can be used.
>
> This was at least the case for the Intel Core 2-Duo. Since Haswell the broadcast is much faster. But
> that’s just an example.
>
> But back to the point: You can consider packing+ugemm being a single component that internally play along.
>
>> - It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.
>>
>> - It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)
> That’s also ok. You are packing the blocks the way the ugemm kernel wants them.
>
>> - It is really sensitive to block size. On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32. Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.
> Yes, if you really want peak performance on different machines you will need more than one micro-kernel. Even
> on a particular machine you sometime have to switch between 4x8, 8x4, 2x16 kernels. And these can be realised
> by different algorithms. But in my opinion these are details. The design should be such, that packing+ugemm is
> a component that easily can be added (just adding a ugemm_myimpl.hpp) somewhere. I think packing+ugemm
> is just a detail that requires “Fleissarbeit” or experts like the BLIS guys. If you can ship a version of uBLAS with
> reasonable good micro-kernels it should be more than good enough for the moment.
>
> More important would be choosing the block sizes MC, KC and NC at runtime depending on the problem size. If
> you look at my benchmarks
>
>
http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html#toc5.3>
> You see there is a sharp bend at N=1600 and B=2300. If you increase the number of threads it becomes even
> more obvious. That’s easy to explain as 1600/256 = 6.25 and 2300/256 = 8,98.. While on this machine MC should
> be about 256 it should vary by a multiple of MR such that all blocks have about the same size. Again this is a detail
> and as my code is indented to be simple and suitable for teaching it is ok. But the right thing is simply a function
> that (could) compute MC, KC, NC depending on the element type and matrix dimensions.
>
> Cheers,
>
> Michael