Ublas :

Date view	Thread view	Subject view	Author view

Subject: Re: [ublas] Matrix multiplication performance
From: Nasos Iliopoulos (nasos_i_at_[hidden])
Date: 2016-02-10 09:49:04

Next message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Can't we just use the boost aligned allocator instead of using malloc?
It would be much safer and seen as much better practice.

On 02/10/2016 09:39 AM, Michael Lehn wrote:
> Hi Joaquim,
>
> I don’t think that this works. When you release memory you have to call free(..) with the pointer
> you received from malloc. This pointer to the originally stored memory gets stored “behind” the
> pointer returned by aligned-malloc. Of course this requires a matching aligned-free.
>
> So if the “usual” malloc gives you proper aligned memory it is in most cases just bad luck. Assume
> storing a pointer takes 8 bytes, you want 16 bytes alignment and allocate n bytes. In this case
> (16+n) bytes get allocated. If your “usual” malloc gives you address 128 for these (16+n) bytes
> then:
>
> 1) bytes at 128, 129, 130, 131, 132, 133, 134, 135 are not used
> 2) bytes at 136, 137, 138, 139, 140, 141, 142, 143 are used to store the original address from malloc (here 128)
> 3) the aligned-malloc returns 144 and you can use the n-bytes at 144, 145, …., 144+n-1
>
> If you call aligned-free with 144 it internally calls free with the address stored in the bytes at (136, …, 143)
>
> IMHO this is also what boost::alignment::aligned_alloc and boost::alignment::aligned_free does.
>
> Cheers,
>
> Michael
>
>
> On 03 Feb 2016, at 00:56, Joaquim Duran Comas <jdurancomas_at_[hidden]> wrote:
>
>> Michael,
>>
>> I've a comment about the function malloc_: it is possible to reduce the memory allocated, exactly 1 byte per call. The change to implement is:
>>
>> void*
>> malloc_(std::size_t alignment, std::size_t size)
>> {
>> alignment = std::max(alignment, alignof(void*));
>> size += alignment - 1; // this is the change: introduce -1
>>
>> void *ptr = std::malloc(size);
>> void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
>> void **vp = (void**) ptr2 - 1;
>> *vp = ptr;
>> return ptr2;
>> }
>>
>> The reason for this change is that to make sure that an aligned address is allocated it is not needed to add alignment to size, because the allocation of 'size' bytes is the first address.
>> As a example, let's assume that we want allocate 1 byte of data aligned at 8 (ignore the space to allocate ptr address for simplicity) and the first memory address is 0. Based on the original code, 9 bytes should be allocated (1 + 8), addresses from 0 until 8, so it includes two 8 bytes aligned addresses (addresses 0 and 8). However, based on the change proposed, 8 bytes should be allocated (1 + 8 - 1), addresses from 0 till 7, and only includes one 8 bytes aligned address (address 0).
>> Joaquim Duran
>>
>>
>> 2016-02-02 23:26 GMT+01:00 Michael Lehn <michael.lehn_at_[hidden]>:
>> I also started to write down some notes on the rational behind the micro-kernel:
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session6/page01.html
>>
>>
>> On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>>
>>> (So the pages are re-done …)
>>>
>>>
>>>
>>> The bug is because NR is BlockSize<T>::NR divided by the vector length. So this
>>> should work
>>>
>>> //---------------------------------------------------------------------------------
>>> template <typename Index, typename T>
>>> typename std::enable_if<std::is_floating_point<T>::value,
>>> void>::type
>>> ugemm(Index kc, T alpha,
>>> const T *A, const T *B,
>>> T beta,
>>> T *C, Index incRowC, Index incColC)
>>> {
>>> A = (const T*) __builtin_assume_aligned (A, 128);
>>> B = (const T*) __builtin_assume_aligned (B, 128);
>>> static const unsigned div = 32/sizeof(T);
>>> static const Index MR = BlockSize<T>::MR;
>>> static const Index NR = BlockSize<T>::NR/div;
>>>
>>> typedef T vx __attribute__((vector_size (32)));
>>>
>>> vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>>> const vx *B_ = (vx *)B;
>>> for (Index l=0; l<kc; ++l) {
>>> for (Index i=0; i<MR; ++i) {
>>> for (Index j=0; j<(NR); ++j) {
>>> P[i * NR + j] += A[i]*B_[j];
>>> }
>>> }
>>> A += MR;
>>> B_ += NR;
>>> }
>>>
>>> T *P_ = (T *)P;
>>> for (Index j=0; j<NR*div; ++j) {
>>> for (Index i=0; i<MR; ++i) {
>>> C[i*incRowC+j*incColC] *= beta;
>>> C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>>> }
>>> }
>>> }
>>> //---------------------------------------------------------------------------------
>>>
>>> However, as Karl pointed out it also should treat the case beta==0
>>> as a special case.
>>>
>>> But I don’t get any speedup. For the performance relevant is
>>> mainly the loop
>>>
>>> for (Index l=0; l<kc; ++l) {
>>> for (Index i=0; i<MR; ++i) {
>>> for (Index j=0; j<(NR); ++j) {
>>> P[i * NR + j] += A[i]*B_[j];
>>> }
>>> }
>>> A += MR;
>>> B_ += NR;
>>> }
>>>
>>> From my experience, only if this is optimized to the extrem a little bit
>>> extra performance can be achieved by treating some special cases for
>>> the update
>>>
>>> T *P_ = (T *)P;
>>> for (Index j=0; j<NR*div; ++j) {
>>> for (Index i=0; i<MR; ++i) {
>>> C[i*incRowC+j*incColC] *= beta;
>>> C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>>> }
>>> }
>>>
>>> The main reason my code looks weird is that I also want to use it
>>> for teaching. The way it is coded you easily can map it forth and
>>> back with the assembly code.
>>>
>>>
>>>> new kernel:
>>>>
>>>> template <typename Index, typename T>
>>>> typename std::enable_if<std::is_floating_point<T>::value,
>>>> void>::type
>>>> ugemm(Index kc, T alpha,
>>>> const T *A, const T *B,
>>>> T beta,
>>>> T *C, Index incRowC, Index incColC)
>>>> {
>>>> A = (const T*) __builtin_assume_aligned (A, 128);
>>>> B = (const T*) __builtin_assume_aligned (B, 128);
>>>> static const unsigned div = 32/sizeof(T);
>>>> static const Index MR = BlockSize<T>::MR;
>>>> static const Index NR = BlockSize<T>::NR/div;
>>>>
>>>> typedef T vx __attribute__((vector_size (32)));
>>>>
>>>> vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>>>> const vx *B_ = (vx *)B;
>>>> for (Index l=0; l<kc; ++l) {
>>>> for (Index i=0; i<MR; ++i) {
>>>> for (Index j=0; j<(NR); ++j) {
>>>> P[i * NR + j] += A[i]*B_[j];
>>>> }
>>>> }
>>>> A += MR;
>>>> B_ += NR;
>>>> }
>>>>
>>>> T *P_ = (T *)P;
>>>> for (Index j=0; j<NR; ++j) {
>>>> for (Index i=0; i<MR; ++i) {
>>>> C[i*incRowC+j*incColC] *= beta;
>>>> C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
>>>> }
>>>> }
>>>> }
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: jdurancomas_at_[hidden]
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: athanasios.iliopoulos.ctr.gr_at_[hidden]
>

Next message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Date view	Thread view	Subject view	Author view