Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Larry Evans (cppljevans_at_[hidden])
Date: 2016-10-25 13:22:59


On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
> On 07:50 Fri 21 Oct , Larry Evans wrote:
>> I can't imagine how anything could be faster
>> than the soa_emitter_static_t because it uses a tuple of
>> std::array<T,particle_count>. I'd guess that the
>> soa_emitter_block_t is only faster by luck (maybe during
>> the soa_emitter_block_t run, my machine was not as busy on some other
>> stuff).
>
> I think the reason why the different implementation techniques are so
> close is that the particle model is memory bound (i.e. it's moving a
> lot of data while each particle update involves relatively few
> calculations).
>
> The difference becomes larger if you're using only a few particles:
> then all particles sit in the upper levels of the cache and the CPU
> doesn't have to wait as much for the data. It would also be worthwhile
> to try a more complex particle model (e.g. by adding interaction
> between the particles). With increased computational intensity
> (floating point operations per byte moved) the delta of the different
> strategies should increase much more.

Thanks for the explanation. The lastest version of the
benchmark:

d6ee370606f7f167dedb93e174459c6c7c4d8c19

reports the relative difference of the times:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823

So, based on what you say above, I guess when
particle_count:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135

increases to a point where the cache is overflowed, the
relative differences between methods should show a sharp
difference?

>
> I've added an implementation of the benchmark based on LibFlatArray's
> SoA containers and expression templates[1]. While working on the
> benchmark, I realized that the vector types ("short_vec") in
> LibFlatArray were lacking some desirable operations (e.g. masked
> move), so to reproduce my results you'll have to use the trunk from
> [2]. I'm very happy that you wrote this benchmark because it's a
> valuable test bed for performance, programmability, and functionality.
> Thanks!

You're welcome. Much of the credit goes to the OP, as
acknowledged, indirectly, here:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L6

>
> One key contribution is that the LibFlatArray-based kernels will
> automatically be vectorized without the user having to touch
> intrinsics (which automatically tie your code to a specific platform).
> LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer
> products), ARM NEON...
>
> I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ
> (Skylake quad-core) to get stable results.

Hmmm. I didn't realize you'd have to run the benchmark
several times to get stable results. I guess that reflect
my ignorance of how benchmarks should be run.

Could you explain how running a couple of times achieves
stable results (actually, on some occassions, I've run the
benchmark and got results completely unexpected, I suspect
it was because some application deamon was stealing cycles
from the benchmark, leading to the unexpedted results).

> Interestingly your SSE code is ~13% faster than the
> LibFlatArray code for large particle counts.

Actually, the SSE code was the OP's.

As intimated above, using the latest version of the
benchmark should make this % difference more apparent. For
example, the output looks like this:

particle_count=1,024
frames=1,000
minimum duration=0.0369697

comparitive performance table:

method rel_duration
________ ______________
SoA 0.902566
Flat 0.907562
Block 0.963046
AoS 1
StdArray 1.15868
LFA undefined
SSE undefined
SSE_opt undefined

The above was done with compiler optimization flag -O0. It
changes dramatically with -O2 or -O3.

> I'll have to take a look at the assembly to figure out why
> that is.

Oh, I bet that will be fun ;)

> (As a library developer having such a test case is incredibly
> valuable, so thanks again!) For fewer particles the LibFlatArray
> kernel is ~31% faster. I assume that this delta would increase with a
> higher computational intensity as it's using AVX. On a SSE-only CPU
> the LibFlatArray code might be a little slower than the hand-optimized
> SSE code.
>
>
> particle_count=1.000.000
> AoS in 9,21448 seconds
> SoA in 5,87921 seconds
> SoA flat in 5,81664 seconds
> SoA Static in 7,10225 seconds
> SoA block in 6,16696 seconds
> LibFlatArray SoA in 5,31733 seconds
> SoA SSE in 4,79973 seconds
> SoA SSE opt in 4,70757 seconds
>
> particle_count=1.024
> AoS in 6,10074 seconds
> SoA in 6,6032 seconds
> SoA flat in 6,70765 seconds
> SoA Static in 6,74453 seconds
> SoA block in 6,54649 seconds
> LibFlatArray SoA in 2,10663 seconds
> SoA SSE in 3,53452 seconds
> SoA SSE opt in 2,76819 seconds
>

 From the above, the LibFlatArray and SSE methods are the
fastest. I'd guess that a new "SoA block SSE" method, which
uses the _mm_* methods, would narrow the difference. I'll
try to figure out how to do that. I notice:

   #include <mmintrin.h>

doesn't produce a compile error; however, that #include
doesn't have the _mm_add_ps used here:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621

Do you know of some package I could install on my ubuntu OS
that makes those SSE functions, such as _mm_add_ps,
available?

[snip]

-regards,
Larry


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk