Of course, I just didn’t know about boost aligned allocator when I coded the example.  Also consider
these codes just as prototypes.  There are many places where you see that the original implementation
comes from a very “puristic-C-sytle” C++ implementation ...


On 10 Feb 2016, at 15:49, Nasos Iliopoulos <nasos_i@hotmail.com> wrote:

Can't we just use the boost aligned allocator instead of using malloc? It would be much safer and seen as much better practice.

On 02/10/2016 09:39 AM, Michael Lehn wrote:
Hi Joaquim,

I don’t think that this works.  When you release memory you have to call free(..) with the pointer
you received from malloc.  This pointer to the originally stored memory gets stored “behind” the
pointer returned by aligned-malloc.  Of course this requires a matching aligned-free.

So if the “usual” malloc gives you proper aligned memory it is in most cases just bad luck.  Assume
storing a pointer takes 8 bytes, you want 16 bytes alignment and allocate n bytes.  In this case
(16+n) bytes get allocated.  If your “usual” malloc gives you address 128 for these (16+n) bytes
then:

1) bytes at 128, 129, 130, 131, 132, 133, 134, 135 are not used
2) bytes at 136, 137, 138, 139, 140, 141, 142, 143 are used to store the original  address from malloc (here 128)
3) the aligned-malloc returns 144 and you can use the n-bytes at 144, 145, …., 144+n-1

If you call aligned-free with 144 it internally calls free with the address stored in the bytes at (136, …, 143)

IMHO this is also what boost::alignment::aligned_alloc and boost::alignment::aligned_free does.

Cheers,

Michael


On 03 Feb 2016, at 00:56, Joaquim Duran Comas <jdurancomas@gmail.com> wrote:

Michael,

I've a comment about the function malloc_: it is possible to reduce the memory allocated, exactly 1 byte per call. The change to implement is:

void*
malloc_(std::size_t alignment, std::size_t size)
{
    alignment = std::max(alignment, alignof(void*));
    size     += alignment - 1;  // this is the change: introduce -1

    void *ptr  = std::malloc(size);
    void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
    void **vp  = (void**) ptr2 - 1;
    *vp        = ptr;
    return ptr2;
}

The reason for this change is that to make sure that an aligned address is allocated it is not needed to add alignment to size, because the allocation of 'size' bytes is the first address.
As a example, let's assume that we want allocate 1 byte of data aligned at 8 (ignore the space to allocate ptr address for simplicity) and the first memory address is 0. Based on the original code, 9 bytes should be allocated (1 + 8), addresses from 0 until 8, so it includes two 8 bytes aligned addresses (addresses 0 and 8). However, based on the change proposed, 8 bytes should be allocated (1 + 8 - 1), addresses from 0 till 7, and only includes one 8 bytes aligned address (address 0).
Joaquim Duran


2016-02-02 23:26 GMT+01:00 Michael Lehn <michael.lehn@uni-ulm.de>:
I also started to write down some notes on the rational behind the micro-kernel:

http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session6/page01.html


On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

(So the pages are re-done …)



The bug is because NR is BlockSize<T>::NR divided by the vector length.  So this
should work

//---------------------------------------------------------------------------------
template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
               const T *A, const T *B,
     T beta,
     T *C, Index incRowC, Index incColC)
{
   A = (const T*) __builtin_assume_aligned (A, 128);
   B = (const T*) __builtin_assume_aligned (B, 128);
   static const unsigned div = 32/sizeof(T);
   static const Index MR = BlockSize<T>::MR;
   static const Index NR = BlockSize<T>::NR/div;

   typedef T vx __attribute__((vector_size (32)));

   vx P[MR*NR] __attribute__ ((aligned (128))) = {};
   const vx *B_ = (vx *)B;
   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

   T *P_ = (T *)P;
   for (Index j=0; j<NR*div; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
       }
   }
}
//---------------------------------------------------------------------------------

However, as Karl pointed out it also should treat the case beta==0
as a special case.

But I don’t get any speedup.  For the performance relevant is
mainly the loop

   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

From my experience, only if this is optimized to the extrem a little bit
extra performance can be achieved by treating some special cases for
the update

   T *P_ = (T *)P;
   for (Index j=0; j<NR*div; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
       }
   }

The main reason my code looks weird is that I also want to use it
for teaching.  The way it is coded you easily can map it forth and
back with the assembly code.


new kernel:

template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
               const T *A, const T *B,
     T beta,
     T *C, Index incRowC, Index incColC)
{
   A = (const T*) __builtin_assume_aligned (A, 128);
   B = (const T*) __builtin_assume_aligned (B, 128);
   static const unsigned div = 32/sizeof(T);
   static const Index MR = BlockSize<T>::MR;
   static const Index NR = BlockSize<T>::NR/div;

   typedef T vx __attribute__((vector_size (32)));

   vx P[MR*NR] __attribute__ ((aligned (128))) = {};
   const vx *B_ = (vx *)B;
   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

   T *P_ = (T *)P;
   for (Index j=0; j<NR; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
       }
   }
}
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de

_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: jdurancomas@gmail.com

_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: athanasios.iliopoulos.ctr.gr@nrl.navy.mil


_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de