Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Michael Marcin (mike.marcin_at_[hidden])
Date: 2016-11-06 15:57:30


On 11/6/2016 10:36 AM, Larry Evans wrote:
> On 10/31/2016 12:00 PM, Michael Marcin wrote:
>> On 10/31/2016 9:14 AM, Larry Evans wrote:
>>>
>>> However, I was still getting the 'double free' error message; hence,
>>> I tried val_grind. It showed a problem in the alive update loop.
>>> When the code was changed to:
>>>
>>> uint64_t *block_ptr = alive.data();
>>> auto e_ptr = energy.data();
>>> for ( size_t i = 0; i < n; ) {
>>> #define REVISED_CODE
>>> #ifdef REVISED_CODE
>>> auto e_i = e_ptr + i;
>>> #endif
>>> uint64_t block = 0;
>>> do {
>>> #ifndef REVISED_CODE
>>> //this code causes valgrind to show errors.
>>> auto e_i = e_ptr + i;
>>> #endif
>>> _mm_store_ps( e_i, _mm_sub_ps( _mm_load_ps( e_i ), t ));
>>> block |=
>>> uint64_t
>>> ( _mm_movemask_ps( _mm_cmple_ps( _mm_load_ps( e_i ),
>>> zero )))
>>> << (i % bits_per_uint64_t)
>>> ;
>>> i += 4;
>>> } while ( i % bits_per_uint64_t != 0 );
>>> *block_ptr++ = block;
>>> }
>>>
>>> valgrind reported no errors; however, when !defined(REVISED_CODE),
>>> valgrind reported:
>>>
>>> valgrind --tool=memcheck
>>> /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
>>>
>>>
>>>
>>> ==7937== Memcheck, a memory error detector
>>> ==7937== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
>>> ==7937== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright
>>> info
>>> ==7937== Command:
>>> /tmp/build/clangxx3_8_pkg/clang/struct_of_arrays/work/soa_compare.benchmark.optim0.exe
>>>
>>>
>>>
>>> ==7937==
>>> COMPILE_OPTIM=0
>>> particle_count=1,000
>>
>>
>> particle_count=1,000 is not a multiple of 64, the optimized energe/alive
>> loop processes 64 particles at a time. I haven't bothered to analyze
>> what the code will do in this case but memory corruption is likely
>>
>> The code to handle a tail (if particle_count % 64 != 0) isn't difficult
>> to add but it is explicitly left out. One of the things you'll often do
>> in a system such as this is fit the data to optimize the algorithm. In
>> the case of a particle system plus or minus 0 to 63 particles is
>> generally unnoticeable.
>>
>> You can address the problem however you like but the simplest solution
>> would be to change your small particle count to 16 * 64 = 1024.
>>
> IIUC, the 64 constraint is because bit_vector stores the bit's in 64
> increments (or bitsof(uint_64_t) where bitsof is defined here:
>
> https://github.com/cppljevans/soa/blob/master/bitsof.hpp
>
> However, with the following SSE_opt alive update loop:
>
>
> https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L961
>
>
> I think the constraint could be relaxed to a multiple of sse_width(=4).
> That's because the additional test:
>
> i_array < n_array
>
> here:
>
>
> https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L978
>
>
> prevents:
>
> auto e_i = e_ptr + i_array;
>
> here:
>
>
> https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L985
>
>
> from accessing outside of the array.
> Does that make sense?
>

The general pattern is to handle full blocks and then separately handle
the final partial block.
This minimizes the amount of work the code must do per iteration.

This code handles multiple of 4 particle counts.
http://codepad.org/iRwgxAkW

Here's an alternative implementation that benchmarks slower on my
machine but is a bit easier to follow IMO.
http://codepad.org/DjqVHac8

P.S. there was a bug in the original sse opt that wouldn't affect perf
testing but _mm_cmple_ps should have been _mm_cmpgt_ps


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk