
Ublas : 
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 20160124 09:47:27
On 24 Jan 2016, at 13:11, Karl Meerbergen <karl.meerbergen_at_[hidden]> wrote:
>
>> Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for lowlevel functionality. This
>> can be supported by simple reference implementations or sophisticated highperformance implementations. But
>> we would have a common ground on which one can build highlevel C++ libraries.
>
> I agree, but there is no std type/concept for a matrix. At this stage, it is impossible for developers of linear algebra to write libraries in C++ since their users have to use the underlying library for their applications. With the bindings, we have tried to suggest such a common ground. I once (many years ago) started to build a library from the bindings, but gave up because of time constraints …
>
I completely understand the problem with the time constraints. But I hope that once there is a common ground
we can take advantage from the fact that different people have different constraints. For myself the primary
constraint for producing code is the possibility to use it for teaching. Otherwise I sooner or later don’t have really
time to maintain it. Like Imre pointed out my code for the matrix product is “C++ code in C style”. That has the
following background. One of my lecture is an underground class “Introduction to HPC”. Students attending this
calls have different majors (e.g. math, physics, engineering) and therefore different backgrounds. Some take the
class before they had numerical linear algebra, others never coded in C/C++. So the class really has to start at
zero: Basics of hardware architectures and introduction to assembly (Intel 64), introduction to C … The goal of the
class is having a blocked implementation of the LU factorization (same algorithm like DGETRF). So that requires
some BLAS Level 1, 2 functions and optimized DGEMM, DTRSM functions. GEMM MicroKernels are optimized
for the machines the students actually use. So this does not sound much. But the students have to code each line
themselves. So that is only possible to restrict things to a “primitive” level. After all it is not a general class on
computer science but on HPC. So having at the end of the semester a LU factorization that can compete with MKL
DGETRF is (in this case) more important than a general implementation.
So long story short. My job allows that I can provide examples for special cases. In particular I can provide
selfcontained proof of concept examples.
This semester we also offer masterlevel class on HPC. Here we also give an introduction on C++. But again I don’t
want that students use a readytouse library. All the stuff like matrix class have to be coded by themselves. IMHO
if students have coded their simplebutown matrix library then they are in the longrun better prepared to use standard
libraries. Also for OpenMP: Before they “just use” OpenMP one project is writing their own threadpool so that they
see how OpenMP works internally.
So long story short. Even in the future we will just use “vim + gcc + g++”. But I hope that students later are able to
use and contribute to “mainstream C++ libraries” in this domain.
************************************************************************************************************************************
One more for proofofconcept examples: For BLAS I have a *complete* headeronly C++ implementation. For near
peak performance it “only” requires an optimized ugemm for float, double, complex<float> and complex<double>.
************************************************************************************************************************************
But anyway, the first step would be to specify a common *lowlevel* C++ standard for BLAS. In my opinion that
should be independent of any matrixclasses. For the my C++ BLAS implementation the GEMM signature is
this:
https://github.com/michaellehn/ulmBLAScore/blob/master/cxxblas/level3/gemm.h
I know this really looks primitive. But you have experience with bindings to BLAS. At one point you have to go to
details like pointers and matrixformats. I also don’t use enums for “Trans”, “NoTrans”, “Lower”, … Every C++
matrix library has its own types for that. So you can save writing traits for converting these. In order to be successful
this has to be attractive for other widely used C++ libraries like MTL4.
And of course, for any C++ BLAS implementation we can offer a C and Fortran interface. Instead of the way around ...