Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Andreas Schäfer (gentryx_at_[hidden])
Date: 2016-10-26 04:23:02


On 22:13 Tue 25 Oct , Michael Marcin wrote:
> On 10/25/2016 12:22 PM, Larry Evans wrote:
> >
> > Hmmm. I didn't realize you'd have to run the benchmark
> > several times to get stable results. I guess that reflect
> > my ignorance of how benchmarks should be run.
>
> The code was just a quick example hacked up to show large difference
> between different techniques.
>
> If you want to compare similar techniques you'll need a more robust
> benchmark.
>
> It would be easy to convert it to use:
> https://github.com/google/benchmark
>
> Which is quite good.

When doing performance measurements you have to take into account the
most common sources of noise:

1. Other processes might eat up CPU time or memory bandwidth.

2. The OS might decide to move your benchmark from one core to
   another, so you're loosing all L1+L2 cache entries. (Solution:
   thread pinning)

3. Thermal conditions and thermal inertia may affect if/when the CPU
   increases its clock speed. (Solution: either disable turbo mode or
   run the benchmark long enough to even out the thermal
   fluctuations.)

AFAIK Google Benchmark doesn't to thread pinning and cannot affect the
turbo mode. LIKWID ( https://github.com/RRZE-HPC/likwid ) can be used
to set clock frequencies and pin threads, and can read the performance
counters of the CPU. Might be a good idea to use both, Google
Benchmark and LIKWID together.

> > Could you explain how running a couple of times achieves
> > stable results (actually, on some occassions, I've run the
> > benchmark and got results completely unexpected, I suspect
> > it was because some application deamon was stealing cycles
> > from the benchmark, leading to the unexpedted results).
> >
> >> Interestingly your SSE code is ~13% faster than the
> >> LibFlatArray code for large particle counts.
> >
> > Actually, the SSE code was the OP's.
> >
>
> Actually it originates from:
>
> https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions

Ah, thanks for the info.

> > From the above, the LibFlatArray and SSE methods are the
> > fastest. I'd guess that a new "SoA block SSE" method, which
> > uses the _mm_* methods, would narrow the difference. I'll
> > try to figure out how to do that. I notice:
> >
> > #include <mmintrin.h>
> >
> > doesn't produce a compile error; however, that #include
> > doesn't have the _mm_add_ps used here:
> >
> > https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
> >
> >
> > Do you know of some package I could install on my ubuntu OS
> > that makes those SSE functions, such as _mm_add_ps,
> > available?
> >
> > [snip]
>
> If you're using gcc I think the header <xmmintrin.h>

The header should not depend on the compiler, but on the CPU model. Or
rather: the vector ISA supported by the CPU:

  http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics

Cheers
-Andreas

-- 
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================
(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk