Boost logo

Ublas :

Subject: [ublas] [PATCH 0/3] boost::ublas Improving the performance of dense matrix multiplication
From: Imre Palik (imre_palik_at_[hidden])
Date: 2016-02-29 02:46:35


This series pulls Michael Lehn's gemm implementation to ublas.

Performance on Haswell as per the bench1 test in ublas:

before:

DOUBLE, 3
bench_3
prod (matrix, matrix)
C array
elapsed: 0.4 s, 321.865 Mflops
c_matrix safe
elapsed: 3.35 s, 38.4317 Mflops
c_matrix fast
elapsed: 2.84 s, 45.3331 Mflops
matrix<unbounded_array> safe
elapsed: 6.28 s, 20.501 Mflops
matrix<unbounded_array> fast
elapsed: 5.87 s, 21.9329 Mflops
DOUBLE, 10
bench_3
prod (matrix, matrix)
C array
elapsed: 0.44 s, 411.814 Mflops
c_matrix safe
elapsed: 2.49 s, 72.7703 Mflops
c_matrix fast
elapsed: 2.3 s, 78.7818 Mflops
matrix<unbounded_array> safe
elapsed: 5.43 s, 33.3698 Mflops
matrix<unbounded_array> fast
elapsed: 5.42 s, 33.4314 Mflops
DOUBLE, 30
bench_3
prod (matrix, matrix)
C array
elapsed: 3.41 s, 445.514 Mflops
c_matrix safe
elapsed: 16.72 s, 90.8614 Mflops
c_matrix fast
elapsed: 16.23 s, 93.6046 Mflops
matrix<unbounded_array> safe
elapsed: 40.55 s, 37.4649 Mflops
matrix<unbounded_array> fast
elapsed: 40.52 s, 37.4927 Mflops
DOUBLE, 100
bench_3
prod (matrix, matrix)
C array
elapsed: 5.07 s, 374.322 Mflops
c_matrix safe
elapsed: 19.2 s, 98.8444 Mflops
c_matrix fast
elapsed: 19.06 s, 99.5704 Mflops
matrix<unbounded_array> safe
elapsed: 48.54 s, 39.0979 Mflops
matrix<unbounded_array> fast
elapsed: 48.54 s, 39.0979 Mflops
DOUBLE, 300
bench_3
prod (matrix, matrix)
C array
elapsed: 3.23 s, 477.516 Mflops
c_matrix safe
elapsed: 15.92 s, 96.883 Mflops
c_matrix fast
elapsed: 15.91 s, 96.9439 Mflops
matrix<unbounded_array> safe
elapsed: 39.6 s, 38.9489 Mflops
matrix<unbounded_array> fast
elapsed: 39.57 s, 38.9785 Mflops
DOUBLE, 1000
bench_3
prod (matrix, matrix)
C array
elapsed: 4.85 s, 393.071 Mflops
c_matrix safe
elapsed: 19.9 s, 95.7987 Mflops
c_matrix fast
elapsed: 19.8 s, 96.2826 Mflops
matrix<unbounded_array> safe
elapsed: 49.16 s, 38.7794 Mflops
matrix<unbounded_array> fast
elapsed: 49.27 s, 38.6928 Mflops

after:

DOUBLE, 3
bench_3
prod (matrix, matrix)
C array
elapsed: 0.37 s, 347.962 Mflops
c_matrix safe
elapsed: 4.52 s, 28.4836 Mflops
c_matrix fast
elapsed: 4.47 s, 28.8022 Mflops
matrix<unbounded_array> safe
elapsed: 6.37 s, 20.2113 Mflops
matrix<unbounded_array> fast
elapsed: 7.75 s, 16.6124 Mflops
DOUBLE, 10
bench_3
prod (matrix, matrix)
C array
elapsed: 0.44 s, 411.814 Mflops
c_matrix safe
elapsed: 2.79 s, 64.9456 Mflops
c_matrix fast
elapsed: 2.79 s, 64.9456 Mflops
matrix<unbounded_array> safe
elapsed: 5.44 s, 33.3085 Mflops
matrix<unbounded_array> fast
elapsed: 5.89 s, 30.7637 Mflops
DOUBLE, 30
bench_3
prod (matrix, matrix)
C array
elapsed: 3.44 s, 441.629 Mflops
c_matrix safe
elapsed: 4.01 s, 378.854 Mflops
c_matrix fast
elapsed: 4.02 s, 377.911 Mflops
matrix<unbounded_array> safe
elapsed: 4.12 s, 368.739 Mflops
matrix<unbounded_array> fast
elapsed: 5.32 s, 285.565 Mflops
DOUBLE, 100
bench_3
prod (matrix, matrix)
C array
elapsed: 5.05 s, 375.804 Mflops
c_matrix safe
elapsed: 3.05 s, 622.233 Mflops
c_matrix fast
elapsed: 3.05 s, 622.233 Mflops
matrix<unbounded_array> safe
elapsed: 3.09 s, 614.179 Mflops
matrix<unbounded_array> fast
elapsed: 3.54 s, 536.105 Mflops
DOUBLE, 300
bench_3
prod (matrix, matrix)
C array
elapsed: 3.23 s, 477.516 Mflops
c_matrix safe
elapsed: 2.05 s, 752.379 Mflops
c_matrix fast
elapsed: 2.04 s, 756.067 Mflops
matrix<unbounded_array> safe
elapsed: 2.05 s, 752.379 Mflops
matrix<unbounded_array> fast
elapsed: 2.17 s, 710.773 Mflops
DOUBLE, 1000
bench_3
prod (matrix, matrix)
C array
elapsed: 4.83 s, 394.699 Mflops
c_matrix safe
elapsed: 2.37 s, 804.386 Mflops
c_matrix fast
elapsed: 2.37 s, 804.386 Mflops
matrix<unbounded_array> safe
elapsed: 2.39 s, 797.655 Mflops
matrix<unbounded_array> fast
elapsed: 2.44 s, 781.309 Mflops

Imre Palik (3):
  ublas: improved dense matrix multiplication performance
  boost::ublas: gcc support for optimal matrix multiplication
  boost::ublas increasing the range of BLAS level 3 benchmarks

 benchmarks/bench1/bench1.cpp | 14 +-
 benchmarks/bench1/bench13.cpp | 8 +
 benchmarks/bench3/bench3.cpp | 14 +-
 benchmarks/bench3/bench33.cpp | 8 +
 include/boost/numeric/ublas/detail/block_sizes.hpp | 78 +++++
 include/boost/numeric/ublas/detail/gemm.hpp | 340 +++++++++++++++++++++
 include/boost/numeric/ublas/detail/vector.hpp | 27 ++
 include/boost/numeric/ublas/matrix_expression.hpp | 140 ++++++++-
 include/boost/numeric/ublas/operation.hpp | 155 +++++++++-
 9 files changed, 757 insertions(+), 27 deletions(-)
 create mode 100644 include/boost/numeric/ublas/detail/block_sizes.hpp
 create mode 100644 include/boost/numeric/ublas/detail/gemm.hpp
 create mode 100644 include/boost/numeric/ublas/detail/vector.hpp

-- 
1.9.1