Michael,

I've a comment about the function malloc_: it is possible to reduce the memory allocated, exactly 1 byte per call. The change to implement is:

void *
malloc_(std::size_t alignment, std::size_t size)
{
    alignment = std::max(alignment, alignof(void *));
    size     += alignment - 1;  // this is the change: introduce -1

    void *ptr  = std::malloc(size);
    void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
    void **vp  = (void**) ptr2 - 1;
    *vp        = ptr;
    return ptr2;
}
The reason for this change is that to make sure that an aligned address is allocated it is not needed to add alignment to size, because the allocation of 'size' bytes is the first address.
As a example, let's assume that we want allocate 1 byte of data aligned at 8 (ignore the space to allocate ptr address for simplicity) and the first memory address is 0. Based on the original code, 9 bytes should be allocated (1 + 8), addresses from 0 until 8, so it includes two 8 bytes aligned addresses (addresses 0 and 8). However, based on the change proposed, 8 bytes should be allocated (1 + 8 - 1), addresses from 0 till 7, and only includes one 8 bytes aligned address (address 0).
Joaquim Duran 


2016-02-02 23:26 GMT+01:00 Michael Lehn <michael.lehn@uni-ulm.de>:
I also started to write down some notes on the rational behind the micro-kernel:

http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session6/page01.html


On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

(So the pages are re-done …)



The bug is because NR is BlockSize<T>::NR divided by the vector length.  So this
should work

//---------------------------------------------------------------------------------
template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
               const T *A, const T *B,
     T beta,
     T *C, Index incRowC, Index incColC)
{
   A = (const T*) __builtin_assume_aligned (A, 128);
   B = (const T*) __builtin_assume_aligned (B, 128);
   static const unsigned div = 32/sizeof(T);
   static const Index MR = BlockSize<T>::MR;
   static const Index NR = BlockSize<T>::NR/div;

   typedef T vx __attribute__((vector_size (32)));

   vx P[MR*NR] __attribute__ ((aligned (128))) = {};
   const vx *B_ = (vx *)B;
   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

   T *P_ = (T *)P;
   for (Index j=0; j<NR*div; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
       }
   }
}
//---------------------------------------------------------------------------------

However, as Karl pointed out it also should treat the case beta==0
as a special case.

But I don’t get any speedup.  For the performance relevant is
mainly the loop

   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

From my experience, only if this is optimized to the extrem a little bit
extra performance can be achieved by treating some special cases for
the update

   T *P_ = (T *)P;
   for (Index j=0; j<NR*div; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
       }
   }

The main reason my code looks weird is that I also want to use it
for teaching.  The way it is coded you easily can map it forth and
back with the assembly code.


new kernel:

template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
               const T *A, const T *B,
     T beta,
     T *C, Index incRowC, Index incColC)
{
   A = (const T*) __builtin_assume_aligned (A, 128);
   B = (const T*) __builtin_assume_aligned (B, 128);
   static const unsigned div = 32/sizeof(T);
   static const Index MR = BlockSize<T>::MR;
   static const Index NR = BlockSize<T>::NR/div;

   typedef T vx __attribute__((vector_size (32)));

   vx P[MR*NR] __attribute__ ((aligned (128))) = {};
   const vx *B_ = (vx *)B;
   for (Index l=0; l<kc; ++l) {
       for (Index i=0; i<MR; ++i) {
           for (Index j=0; j<(NR); ++j) {
               P[i * NR + j] += A[i]*B_[j];
           }
       }
       A += MR;
       B_ += NR;
   }

   T *P_ = (T *)P;
   for (Index j=0; j<NR; ++j) {
       for (Index i=0; i<MR; ++i) {
           C[i*incRowC+j*incColC] *= beta;
           C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
       }
   }
}

_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de


_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: jdurancomas@gmail.com