|
Boost : |
From: Sergei Marchenko (serge_v_m_at_[hidden])
Date: 2021-01-11 21:56:36
> I will need to experiment more with both of these libraries to get a better sense which one is the best fit. The preliminary idea is to split responsibilities between NN and uBlas/Boost.Compute such that NN library defines an interface and familiar abstractions in the NN domain, and uBlas/Boost.Compute are used as the core computation engine. If this idea works out as I hope it will do, we can put aside the discussion about the hardware support, because it will come with the underlying compute engine, and we can focus more on the convenience of the interface and abstractions that an NN library can provide for easier use of ML elements.
Blast.Compute + OpenCL extensions to leverage hardware definitely look like the right path to go and would be a useful addition to this library. It would require a careful selection of OpenCL kernels for optimal speed, which was obvious from this simple test with different implementations of Matrix * Vector that I ran on a few OpenCL devices that are available on my computer. To my surprise, plain C++ version was outperforming my GPU, and I got a nice increase from OpenCL implementation on CPU with a simplistic kernel. I must have a very old and slow GPU.
These are the raw test results for 4096 x 4096 matrix in case anybody is interested .
Best regards,
Sergei Marchenko
OpenCL Platform: 'ATI Stream' (vendor: Advanced Micro Devices, Inc.)
Devices:
Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz ' (version: 2.0 ) (type: CPU)
Device: 'Toucan ' (version: CAL 1.4.1848 ) (type: GPU)
Extensions:
cl_khr_icd
cl_amd_event_callback
cl_khr_d3d10_sharing
OpenCL Platform: 'AMD Accelerated Parallel Processing' (vendor: Advanced Micro Devices, Inc.)
Devices:
Device: 'Turks' (version: 1800.11 (VM)) (type: GPU)
Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU)
Extensions:
cl_khr_icd
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_dx9_media_sharing
cl_amd_event_callback
cl_amd_offline_devices
Test Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz ' (version: 2.0 ) (type: CPU)
Testing matrix * vector (map+reduce kernels):
Map Elapsed: 45999200 ns
Map BandWidth: 1.50486 GB/s
Reduce Elapsed: 170500 ns
Reduce BandWidth: 12.3961 GB/s
Elapsed: 54297900 ns
BandWidth: 2.47188 GB/s
Testing matrix * vector (naive kernel):
Elapsed: 5341900 ns
BandWidth: 12.5689 GB/s
Testing matrix * vector (Boost.Compute algorithms):
Elapsed: 724216500 ns
BandWidth: 0.185351 GB/s
Testing matrix * vector (plain C++):
Elapsed: 17725800 ns
BandWidth: 3.78779 GB/s
Test Device: 'Toucan ' (version: CAL 1.4.1848 ) (type: GPU)
Testing matrix * vector (map+reduce kernels):
Map Elapsed: 490535376 ns
Map BandWidth: 0.141116 GB/s
Reduce Elapsed: 2236373 ns
Reduce BandWidth: 0.945073 GB/s
Elapsed: 602027100 ns
BandWidth: 0.222943 GB/s
Testing matrix * vector (naive kernel):
Elapsed: 170503700 ns
BandWidth: 0.393784 GB/s
Testing matrix * vector (Boost.Compute algorithms):
Elapsed: 6837179400 ns
BandWidth: 0.019633 GB/s
Testing matrix * vector (plain C++):
Elapsed: 17901600 ns
BandWidth: 3.75059 GB/s
Test Device: 'Turks' (version: 1800.11 (VM)) (type: GPU)
Testing matrix * vector (map+reduce kernels):
Map Elapsed: 222894000 ns
Map BandWidth: 0.310562 GB/s
Reduce Elapsed: 5166778 ns
Reduce BandWidth: 0.409063 GB/s
Elapsed: 248867100 ns
BandWidth: 0.539315 GB/s
Testing matrix * vector (naive kernel):
Elapsed: 156637700 ns
BandWidth: 0.428643 GB/s
Testing matrix * vector (Boost.Compute algorithms):
Elapsed: 2145102000 ns
BandWidth: 0.062577 GB/s
Testing matrix * vector (plain C++):
Elapsed: 17918300 ns
BandWidth: 3.7471 GB/s
Test Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU)
Testing matrix * vector (map+reduce kernels):
Map Elapsed: 37620700 ns
Map BandWidth: 1.84001 GB/s
Reduce Elapsed: 245500 ns
Reduce BandWidth: 8.60911 GB/s
Elapsed: 43919500 ns
BandWidth: 3.05599 GB/s
Testing matrix * vector (naive kernel):
Elapsed: 5410200 ns
BandWidth: 12.4102 GB/s
Testing matrix * vector (Boost.Compute algorithms):
Elapsed: 641987200 ns
BandWidth: 0.209092 GB/s
Testing matrix * vector (plain C++):
Elapsed: 17944000 ns
BandWidth: 3.74173 GB/s
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk