Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Michael Marcin (mike.marcin_at_[hidden])
Date: 2016-10-17 04:29:36


On 10/16/2016 10:49 PM, degski wrote:
>> This is still a toy example but it's closer to something real.
>>
>
> Yes, but is 1M particles common?

Depends on the game.
To be fair depending where your bottlenecks are you might move code like
this to compute shaders instead.

>
> AoS in 6.54421 seconds
>> SoA in 5.91915 seconds
>> SoA SSE in 3.58603 seconds
>
>
> 1M particles on my Ci3 5005U 2.0GHZ/AVX2/4GB laptop / WIN10 / Clang/LLVM
> 4.0:
>
> AoS in 14.7198 seconds
> SoA in 13.5969 seconds
> SoA SSE in 8.78095 seconds
>
> I've run this with a count of 25'000 and it shows something(s) interesting:
>
> AoS in 0.274145 seconds
> SoA in 0.312875 seconds
> SoA SSE in 0.0768812 seconds
>
> 1. SoA slower than AoS.
> 2. SoA SSE way faster (relatively) than either SoA and AoS.
>
> You've definitely made your case, when using SSE. I'll have a rethink.
>

Indeed the code generated for SoA here is much worse than the AoS.

AoS update is roughly ~40 assembly instructions.
SoA update is roughly ~200 assembly instructions.

A lot of this is probably due to the soa_emitter_t implementation being
suboptimal.

Also, siozeof( particle_t ) = 68 bytes
68 * 25'000 = 1.7 megs

Your CPU has a 3 megabytes l3 cache so the entire data structure can
stay in fast memory.

All version (aos, soa, soa_sse) are has access patterns that are very
friendly to the memory prefetcher so l2 and l1 cache size should not
affect the results.

So with no main memory access the update with the much better code (aos)
wins.

If you bump your count up to 50'000 (3.4 megs) you might see SoA pull
ahead again, at 100'000 (6.8 megs) you should definitely see it.

Alternatively you could add more data the particles like say:
struct pad_t {
     char data[64];
};

struct particle_t {
     ...
     pad_t pad;
};

This additional data won't affect SoA update at all but should affect
your AoS update. (Rough math 3megs / 25k elements = 120 bytes per
element max to fit all in cache).


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk