Boost logo

Ublas :

Subject: Re: [ublas] Matrix multiplication performance
From: palik imre (imre_palik_at_[hidden])
Date: 2016-01-27 10:01:48

Following is the best gcc simd array based mikrokernel I was able to come up:
template <typename Index>
ugemm(Index kc, double alpha,
      const double *A, const double *B,
      double beta,
      double *C, Index incRowC, Index incColC)
    const Index MR = BlockSize<double>::MR;
    const Index NR = BlockSize<double>::NR;

    typedef double v4df __attribute__((vector_size (32)));

    v4df P[BlockSize<double>::MR*BlockSize<double>::NR/4 + 1] __attribute__ ((aligned (128)));
    const v4df *B_ = (v4df *)B;
    const v4df nv = {0.,0.,0.,0.};
    for (Index l=0; l<MR*NR/4; ++l) {
      P[l] = nv;
    for (Index i=0; i<MR; ++i) {
      for (Index l=0; l<kc; ++l) {
        for (Index j=0; j<(NR/4); ++j) {
          P[i * NR/4 + j] += A[l + i*kc]*B_[l*(NR/4)+j];
    double *P_ = (double *)P;
    for (Index j=0; j<NR; ++j) {
        for (Index i=0; i<MR; ++i) {
            C[i*incRowC+j*incColC] *= beta;
            C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
Notes about it

- It is row major, as I can think easier that way.  So it needs a different packing routine.

- It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.
- It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)
- It is really sensitive to block size.  On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32.  Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.


    On Monday, 25 January 2016, 16:10, palik imre <imre_palik_at_[hidden]> wrote:

 AFAIK there is a boost::simd project.  If we really want SIMD classes, we might try to help them to get to mainline.

Right now I try to catch up with Michael using gcc SIMD vectors. (His code is still 10% faster ...)  This should work on gcc, icc, and clang.  I think that is general enough for most people's needs.
Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.

    On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <jdurancomas_at_[hidden]> wrote:

It has been a great job.
The micro-kernel implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.

Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.
Joaquim Duran

ublas mailing list
Sent to: imre_palik_at_[hidden]