void * malloc_(std::size_t alignment, std::size_t size) { alignment = std::max(alignment, alignof(void *)); size += alignment - 1; // this is the change: introduce -1 void *ptr = std::malloc(size); void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1)); void **vp = (void**) ptr2 - 1; *vp = ptr; return ptr2; }
The reason for this change is that to make sure that an aligned address is allocated it is not needed to add alignment to size, because the allocation of 'size' bytes is the first address.
As a example, let's assume that we want allocate 1 byte of data aligned at 8 (ignore the space to allocate ptr address for simplicity) and the first memory address is 0. Based on the original code, 9 bytes should be allocated (1 + 8), addresses from 0 until 8, so it includes two 8 bytes aligned addresses (addresses 0 and 8). However, based on the change proposed, 8 bytes should be allocated (1 + 8 - 1), addresses from 0 till 7, and only includes one 8 bytes aligned address (address 0).
Joaquim Duran
I also started to write down some notes on the rational behind the micro-kernel:On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn@uni-ulm.de> wrote:_______________________________________________(So the pages are re-done …)The bug is because NR is BlockSize<T>::NR divided by the vector length. So thisshould work//---------------------------------------------------------------------------------template <typename Index, typename T>typename std::enable_if<std::is_floating_point<T>::value,void>::typeugemm(Index kc, T alpha,const T *A, const T *B,T beta,T *C, Index incRowC, Index incColC){A = (const T*) __builtin_assume_aligned (A, 128);B = (const T*) __builtin_assume_aligned (B, 128);static const unsigned div = 32/sizeof(T);static const Index MR = BlockSize<T>::MR;static const Index NR = BlockSize<T>::NR/div;typedef T vx __attribute__((vector_size (32)));vx P[MR*NR] __attribute__ ((aligned (128))) = {};const vx *B_ = (vx *)B;for (Index l=0; l<kc; ++l) {for (Index i=0; i<MR; ++i) {for (Index j=0; j<(NR); ++j) {P[i * NR + j] += A[i]*B_[j];}}A += MR;B_ += NR;}T *P_ = (T *)P;for (Index j=0; j<NR*div; ++j) {for (Index i=0; i<MR; ++i) {C[i*incRowC+j*incColC] *= beta;C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];}}}//---------------------------------------------------------------------------------However, as Karl pointed out it also should treat the case beta==0as a special case.But I don’t get any speedup. For the performance relevant ismainly the loopfor (Index l=0; l<kc; ++l) {for (Index i=0; i<MR; ++i) {for (Index j=0; j<(NR); ++j) {P[i * NR + j] += A[i]*B_[j];}}A += MR;B_ += NR;}From my experience, only if this is optimized to the extrem a little bitextra performance can be achieved by treating some special cases forthe updateT *P_ = (T *)P;for (Index j=0; j<NR*div; ++j) {for (Index i=0; i<MR; ++i) {C[i*incRowC+j*incColC] *= beta;C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];}}The main reason my code looks weird is that I also want to use itfor teaching. The way it is coded you easily can map it forth andback with the assembly code.new kernel:
template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
const T *A, const T *B,
T beta,
T *C, Index incRowC, Index incColC)
{
A = (const T*) __builtin_assume_aligned (A, 128);
B = (const T*) __builtin_assume_aligned (B, 128);
static const unsigned div = 32/sizeof(T);
static const Index MR = BlockSize<T>::MR;
static const Index NR = BlockSize<T>::NR/div;
typedef T vx __attribute__((vector_size (32)));
vx P[MR*NR] __attribute__ ((aligned (128))) = {};
const vx *B_ = (vx *)B;
for (Index l=0; l<kc; ++l) {
for (Index i=0; i<MR; ++i) {
for (Index j=0; j<(NR); ++j) {
P[i * NR + j] += A[i]*B_[j];
}
}
A += MR;
B_ += NR;
}
T *P_ = (T *)P;
for (Index j=0; j<NR; ++j) {
for (Index i=0; i<MR; ++i) {
C[i*incRowC+j*incColC] *= beta;
C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
}
}
}
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: michael.lehn@uni-ulm.de
_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: jdurancomas@gmail.com