Following is the best gcc simd array based mikrokernel I was able to come up:

template <typename Index>
void
ugemm(Index kc, double alpha,
      const double *A, const double *B,
      double beta,
      double *C, Index incRowC, Index incColC)
{
    const Index MR = BlockSize<double>::MR;
    const Index NR = BlockSize<double>::NR;

    typedef double v4df __attribute__((vector_size (32)));

    v4df P[BlockSize<double>::MR*BlockSize<double>::NR/4 + 1] __attribute__ ((aligned (128)));
    const v4df *B_ = (v4df *)B;
    const v4df nv = {0.,0.,0.,0.};
    for (Index l=0; l<MR*NR/4; ++l) {
      P[l] = nv;
    }
    for (Index i=0; i<MR; ++i) {
      for (Index l=0; l<kc; ++l) {
        for (Index j=0; j<(NR/4); ++j) {
          P[i * NR/4 + j] += A[l + i*kc]*B_[l*(NR/4)+j];
        }
      }
    }
    double *P_ = (double *)P;
    for (Index j=0; j<NR; ++j) {
        for (Index i=0; i<MR; ++i) {
            C[i*incRowC+j*incColC] *= beta;
            C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
        }
    }
}

Notes about it

- It is row major, as I can think easier that way. So it needs a different packing routine.

- It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.

- It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)

- It is really sensitive to block size. On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32. Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.

Cheers,

Imre

On Monday, 25 January 2016, 16:10, palik imre <imre_palik@yahoo.co.uk> wrote:

AFAIK there is a boost::simd project. If we really want SIMD classes, we might try to help them to get to mainline.

Right now I try to catch up with Michael using gcc SIMD vectors. (His code is still 10% faster ...) This should work on gcc, icc, and clang. I think that is general enough for most people's needs.

Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.

Cheers,

Imre

On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <jdurancomas@gmail.com> wrote:

Hello,

It has been a great job.

The micro-kernel implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.

http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics

https://www.cs.uaf.edu/2009/fall/cs301/lecture/11_13_sse_intrinsics.html

Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.

Joaquim Duran

_______________________________________________
ublas mailing list
ublas@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: imre_palik@yahoo.co.uk