first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf


and 

http://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.pdf

of course 


https://github.com/flame/blis