Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Andreas Schäfer (gentryx_at_[hidden])
Date: 2016-10-26 04:33:28

On 12:22 Tue 25 Oct , Larry Evans wrote:
> On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
> > On 07:50 Fri 21 Oct , Larry Evans wrote:
> >> I can't imagine how anything could be faster
> >> than the soa_emitter_static_t because it uses a tuple of
> >> std::array<T,particle_count>. I'd guess that the
> >> soa_emitter_block_t is only faster by luck (maybe during
> >> the soa_emitter_block_t run, my machine was not as busy on some other
> >> stuff).
> >
> > I think the reason why the different implementation techniques are so
> > close is that the particle model is memory bound (i.e. it's moving a
> > lot of data while each particle update involves relatively few
> > calculations).
> >
> > The difference becomes larger if you're using only a few particles:
> > then all particles sit in the upper levels of the cache and the CPU
> > doesn't have to wait as much for the data. It would also be worthwhile
> > to try a more complex particle model (e.g. by adding interaction
> > between the particles). With increased computational intensity
> > (floating point operations per byte moved) the delta of the different
> > strategies should increase much more.
> Thanks for the explanation. The lastest version of the
> benchmark:
> d6ee370606f7f167dedb93e174459c6c7c4d8c19
> reports the relative difference of the times:

Yeah, I saw that change when I merged upstream. TBH, I don't think
this is helpful as the relative difference adds noice from one
measurement to all other measurements. It complicates comparison
between multiple runs of the benchmark and prevents conversion into
other metrics (e.g. GFLOPS).

> So, based on what you say above, I guess when
> particle_count:
> increases to a point where the cache is overflowed, the
> relative differences between methods should show a sharp
> difference?

The difference between the method is reduced when more and more
particles are being used as then the memory bandwidth becomes the
limiting factor. The transition between "in cache" and "in memory"
isn't sharp, but rather smooth as the L3 cache will still retain some
data, even if the total data set is too large to fit into L3. If you
vary the number of particles, you should be able to observe different
performance levels based on the cache level the data set fits into
(32kB for L1, 256kB for L2 (on Intel), some MB for L3).

> > I've added an implementation of the benchmark based on LibFlatArray's
> > SoA containers and expression templates[1]. While working on the
> > benchmark, I realized that the vector types ("short_vec") in
> > LibFlatArray were lacking some desirable operations (e.g. masked
> > move), so to reproduce my results you'll have to use the trunk from
> > [2]. I'm very happy that you wrote this benchmark because it's a
> > valuable test bed for performance, programmability, and functionality.
> > Thanks!
> You're welcome. Much of the credit goes to the OP, as
> acknowledged, indirectly, here:

Thanks. Sorry for the confusion. :-)

> > I'll have to take a look at the assembly to figure out why
> > that is.
> Oh, I bet that will be fun ;)

I hope so. Hope dies last. ;-)


Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!

Boost list run by bdawes at, gregod at, cpdaniel at, john at