Following is the best gcc simd array based mikrokernel I was able to come up:
template <typename Index>
void
ugemm(Index kc, double alpha,
const double *A, const double *B,
double beta,
double *C, Index incRowC, Index incColC)
{
const Index MR = BlockSize<double>::MR;
const Index NR = BlockSize<double>::NR;
typedef double v4df __attribute__((vector_size (32)));
v4df P[BlockSize<double>::MR*BlockSize<double>::NR/4 + 1] __attribute__ ((aligned (128)));
const v4df *B_ = (v4df *)B;
const v4df nv = {0.,0.,0.,0.};
for (Index l=0; l<MR*NR/4; ++l) {
P[l] = nv;
}
for (Index i=0; i<MR; ++i) {
for (Index l=0; l<kc; ++l) {
for (Index j=0; j<(NR/4); ++j) {
P[i * NR/4 + j] += A[l + i*kc]*B_[l*(NR/4)+j];
}
}
}
double *P_ = (double *)P;
for (Index j=0; j<NR; ++j) {
for (Index i=0; i<MR; ++i) {
C[i*incRowC+j*incColC] *= beta;
C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
}
}
}
Notes about it
- It is row major, as I can think easier that way. So it needs a different packing routine.
- It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.
- It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)
- It is really sensitive to block size. On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32. Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.
Cheers,
Imre
On Monday, 25 January 2016, 16:10, palik imre <imre_palik@yahoo.co.uk> wrote:
AFAIK there is a boost::simd project. If we really want SIMD classes, we might try to help them to get to mainline.
Right
now I try to catch up with Michael using gcc SIMD vectors. (His code is
still 10% faster ...) This should work on gcc, icc, and clang. I
think that is general enough for most people's needs.
Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.
Cheers,
Imre
On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <jdurancomas@gmail.com> wrote:
Hello,
It has been a great job.
The micro-kernel implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.
Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.
Joaquim Duran
_______________________________________________
ublas mailing list
ublas@lists.boost.orghttp://lists.boost.org/mailman/listinfo.cgi/ublasSent to:
imre_palik@yahoo.co.uk