Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: Joaquim Duran Comas (jdurancomas_at_[hidden])
Date: 2016-01-27 19:04:33


If explicit simd should not be used, by now, then you should help the
compiler to generate more optimized code, by aligning properly the buffers.

There is the boost.align library, which provides an aligned allocator (
http://www.boost.org/doc/libs/1_60_0/doc/html/align.html) and the macro
BOOST_ALIGNMENT(x) to align variables or member variables, useful for
bounded and fixed containers (
http://www.boost.org/doc/libs/1_59_0/boost/config/suffix.hpp).

Thanks and Best Regards,
Joaquim Duran

2016-01-27 17:30 GMT+01:00 Nasos Iliopoulos <nasos_i_at_[hidden]>:

> Please note that I would vote against explicit vectorization atm. Probably
> we are going to have simd in c++17 (
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4184.pdf) that
> afaik is not related to boost.simd. We can discuss about it then.
>
> Also if you have access to an intel compiler run a test. You will be
> surprised with the performance you are getting mainly because of implicit
> vectorization that in my experience is on par with explicit vectorization.
>
> As a side note in GSOC 2013 Sathyam had developed an aligned allocator. We
> will include it into storage.hpp but we didn't have the need up to now.
> This will need some modifications since compilers now have better support
> for alignment.
>
>
> https://bitbucket.org/sathyamvellal/ublas/src/49922e961f1eb17a8a66e4ac7de1663e595ea5c9/aligned_allocator/allocator.hpp?at=aligned_allocator&fileviewer=file-view-default
>
>
> -Nasos
>
>
>
> On 01/27/2016 10:41 AM, Michael Lehn wrote:
>
>> - It is row major, as I can think easier that way. So it needs a
>>> different packing routine.
>>>
>> Having specialised packing routines for a certain micro-kernel is ok and
>> not unusual. In general that
>> is what you need to achieve the best performance. On some machines the
>> broadcast instruction (filling
>> all slots of a vector register) is really slow. In this case it pays off
>> to pack aligned duplicates of the value
>> into the puffer. So instead of broadcasting the (possibly) faster
>> mov-instruction can be used.
>>
>> This was at least the case for the Intel Core 2-Duo. Since Haswell the
>> broadcast is much faster. But
>> that’s just an example.
>>
>> But back to the point: You can consider packing+ugemm being a single
>> component that internally play along.
>>
>> - It won't compile on gcc 4.6, as that compiler is unwilling to do
>>> vbroadcastsd.
>>>
>>> - It assumes that the A & B arrays are properly aligned. (gcc won't emit
>>> unaligned vector stores for simd arrays)
>>>
>> That’s also ok. You are packing the blocks the way the ugemm kernel
>> wants them.
>>
>> - It is really sensitive to block size. On my old AMD box it come within
>>> 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2
>>> box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32.
>>> Part of the reason for the difference is that I cannot persuade gcc to
>>> accumulate to register.
>>>
>> Yes, if you really want peak performance on different machines you will
>> need more than one micro-kernel. Even
>> on a particular machine you sometime have to switch between 4x8, 8x4,
>> 2x16 kernels. And these can be realised
>> by different algorithms. But in my opinion these are details. The
>> design should be such, that packing+ugemm is
>> a component that easily can be added (just adding a ugemm_myimpl.hpp)
>> somewhere. I think packing+ugemm
>> is just a detail that requires “Fleissarbeit” or experts like the BLIS
>> guys. If you can ship a version of uBLAS with
>> reasonable good micro-kernels it should be more than good enough for the
>> moment.
>>
>> More important would be choosing the block sizes MC, KC and NC at runtime
>> depending on the problem size. If
>> you look at my benchmarks
>>
>>
>> http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html#toc5.3
>>
>> You see there is a sharp bend at N=1600 and B=2300. If you increase the
>> number of threads it becomes even
>> more obvious. That’s easy to explain as 1600/256 = 6.25 and 2300/256 =
>> 8,98.. While on this machine MC should
>> be about 256 it should vary by a multiple of MR such that all blocks have
>> about the same size. Again this is a detail
>> and as my code is indented to be simple and suitable for teaching it is
>> ok. But the right thing is simply a function
>> that (could) compute MC, KC, NC depending on the element type and matrix
>> dimensions.
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: nasos_i_at_[hidden]
>>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: jdurancomas_at_[hidden]
>