|
Ublas : |
Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-29 17:31:36
Here is a more sane kernel using gcc simd vectors:
template <typename Index, typename T>
typename std::enable_if<std::is_floating_point<T>::value,
void>::type
ugemm(Index kc, T alpha,
const T *A, const T *B,
T beta,
T *C, Index incRowC, Index incColC)
{
static const Index MR = BlockSize<T>::MR;
static const Index NR = BlockSize<T>::NR;
static const unsigned div = 32/sizeof(T);
typedef T vx __attribute__((vector_size (32)));
vx P[MR*NR/div] __attribute__ ((aligned (128))) = {};
const vx *B_ = (vx *)B;
for (Index l=0; l<kc; ++l) {
for (Index j=0; j<(NR/div); ++j) {
for (Index i=0; i<MR; ++i) {
P[i * NR/div + j] += A[i + l * MR]*B_[l*(NR/div)+j];
}
}
}
T *P_ = (T *)P;
for (Index i=0; i<MR; ++i) {
for (Index j=0; j<NR; ++j) {
C[i*incRowC+j*incColC] *= beta;
C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
}
}
}
this plugs nicely to Michael's code, and slightly faster than my previous try (but still nowhere near to Michael's hand crafted kernel).
The good blocksize for double is MR=2, NR=16, KC=512
multiply MR and NR by 2 for float.
Cheers,
Imre