Ublas :

Date view	Thread view	Subject view	Author view

Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-27 10:01:48

Next message: nasos: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: palik imre: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Following is the best gcc simd array based mikrokernel I was able to come up:
template <typename Index>
void
ugemm(Index kc, double alpha,
Â Â Â Â Â const double *A, const double *B,
Â Â Â Â Â double beta,
Â Â Â Â Â double *C, Index incRowC, Index incColC)
{
Â Â Â const Index MR = BlockSize<double>::MR;
Â Â Â const Index NR = BlockSize<double>::NR;

Â Â Â typedef double v4df __attribute__((vector_size (32)));

Â Â Â v4df P[BlockSize<double>::MR*BlockSize<double>::NR/4 + 1] __attribute__ ((aligned (128)));
Â Â Â const v4df *B_ = (v4df *)B;
Â Â Â const v4df nv = {0.,0.,0.,0.};
Â Â Â for (Index l=0; l<MR*NR/4; ++l) {
Â Â Â Â Â P[l] = nv;
Â Â Â }
Â Â Â for (Index i=0; i<MR; ++i) {
Â Â Â Â Â for (Index l=0; l<kc; ++l) {
Â Â Â Â Â Â Â for (Index j=0; j<(NR/4); ++j) {
Â Â Â Â Â Â Â Â Â P[i * NR/4 + j] += A[l + i*kc]*B_[l*(NR/4)+j];
Â Â Â Â Â Â Â }
Â Â Â Â Â }
Â Â Â }
Â Â Â double *P_ = (double *)P;
Â Â Â for (Index j=0; j<NR; ++j) {
Â Â Â Â Â Â Â for (Index i=0; i<MR; ++i) {
Â Â Â Â Â Â Â Â Â Â Â C[i*incRowC+j*incColC] *= beta;
Â Â Â Â Â Â Â Â Â Â Â C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
Â Â Â Â Â Â Â }
Â Â Â }
}
Notes about it

- It is row major, as I can think easier that way.Â So it needs a different packing routine.

- It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.
- It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)
- It is really sensitive to block size.Â On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32.Â Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.
Cheers,
Imre

On Monday, 25 January 2016, 16:10, palik imre <imre_palik_at_[hidden]> wrote:

AFAIK there is a boost::simd project.Â If we really want SIMD classes, we might try to help them to get to mainline.

Right now I try to catch up with Michael using gcc SIMD vectors. (His code is still 10% faster ...)Â This should work on gcc, icc, and clang.Â I think that is general enough for most people's needs.
Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.
Cheers,
Imre

On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <jdurancomas_at_[hidden]> wrote:

Hello,
It has been a great job.
TheÂ micro-kernelÂ implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics
https://www.cs.uaf.edu/2009/fall/cs301/lecture/11_13_sse_intrinsics.html

Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.
Joaquim Duran

_______________________________________________
ublas mailing list
ublas_at_[hidden]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: imre_palik_at_[hidden]

text/html attachment: attachment

Next message: nasos: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: palik imre: "Re: [ublas] Matrix multiplication performance"
Next in thread: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Reply: Michael Lehn: "Re: [ublas] Matrix multiplication performance"

Date view	Thread view	Subject view	Author view