Boost logo

Boost :

Subject: Re: [boost] interest in structure of arrays container?
From: Michael Marcin (mike.marcin_at_[hidden])
Date: 2016-10-27 02:10:07


On 10/26/2016 1:33 PM, Chris Glover wrote:
>>
>> I guess some optimisation from way yonder (something modern compilers do
>> routinely, even on a Monday morning!)... but more than probable irrelevant
>> nowadays...
>>
>> degski
>>
>
> I might be pessimistic, but I never trust the compiler and generally check
> what's being output. In this case, FWIW, on MSVC2015, the bit-twiddling
> version generates faster code than the mod version -- about 25% faster. I
> didn't test gcc or clang.
>
> Using google benchmark:
>
> Code:
>
> static void AlignedMod(benchmark::State& state)
> {
> while (state.KeepRunning())
> {
> for(int i = state.range_x(); i < 128; i += state.range_y())
> {
> bool aligned = (i % 16) == 0;
> benchmark::DoNotOptimize(aligned);
> }
> }
> }
> BENCHMARK(AlignedMod)->ArgPair(1, 1);
>
> static void AlignedAnd(benchmark::State& state)
> {
> while (state.KeepRunning())
> {
> for(int i = state.range_x(); i < 128; i += state.range_y())
> {
> bool aligned = ((i - 1) & 15) == 0;
> benchmark::DoNotOptimize(aligned);
> }
> }
> }
> BENCHMARK(AlignedAnd)->ArgPair(1, 1);
>
> Generated code of the inner loop:
>
> Mod version:
> mov eax,ebx
> and eax,8000000Fh
> jge AlignedMod+50h
> dec eax
> or eax,0FFFFFFF0h
> inc eax
> test eax,eax
> lea rcx,[aligned]
> sete byte ptr [aligned]
> call 07FF73B84A180h
> add ebx,dword ptr [rdi+1Ch]
> cmp ebx,80h
> jl AlignedMod+40h
>
>
> And version:
> lea eax,[rbx-1]
> test al,0Fh
> lea rcx,[aligned]
> sete byte ptr [aligned]
> call 07FF73B84A180h
> add ebx,dword ptr [rdi+1Ch]
> cmp ebx,80h
> jl AlignedAnd+40h
>
>
> Result:
> Benchmark Time CPU Iterations
> -------------------------------------------------------------------------
> AlignedMod/1/1 204 ns 203 ns 4072727
> AlignedAnd/1/1 153 ns 154 ns 4977778
>

It's very dangerous to draw conclusions from benchmarks like this.

The operation and elapsed time are way too small for the results to be
distinguishable from noise on the system.

AlignedAnd has a random '- 1' in it so it's not even checking if i is 16
byte aligned.

AlignedMod is taking the mod of a signed integer. Make it use an
unsigned integer and some very familiar code gets generated.

         for ( unsigned i = state.range_x(); i < 128; i += state.range_y() )
00007FF7ACBA316C mov ebx,dword ptr [rdi+14h]
00007FF7ACBA316F cmp ebx,80h
00007FF7ACBA3175 jae AlignedMod+12h (07FF7ACBA3132h)
00007FF7ACBA3177 nop word ptr [rax+rax]
         {
             bool aligned = (i % 16) == 0;
00007FF7ACBA3180 test bl,0Fh
             benchmark::DoNotOptimize( aligned );
00007FF7ACBA3183 lea rcx,[aligned]
         {
             bool aligned = (i % 16) == 0;
00007FF7ACBA3188 sete byte ptr [aligned]
             benchmark::DoNotOptimize( aligned );
00007FF7ACBA318D call benchmark::internal::UseCharPointer
(07FF7ACBA1019h)
     {
         for ( unsigned i = state.range_x(); i < 128; i += state.range_y() )
00007FF7ACBA3192 add ebx,dword ptr [rdi+1Ch]
00007FF7ACBA3195 cmp ebx,80h
00007FF7ACBA319B jb AlignedMod+60h (07FF7ACBA3180h)
         }
     }
00007FF7ACBA319D jmp AlignedMod+12h (07FF7ACBA3132h)
}
00007FF7ACBA319F mov rbx,qword ptr [rsp+38h]
00007FF7ACBA31A4 mov rsi,qword ptr [rsp+40h]
00007FF7ACBA31A9 add rsp,20h
00007FF7ACBA31AD pop rdi
00007FF7ACBA31AE ret

I don't agree that it is more obscure. I think it's perfectly natural to
understand that the bottom 4 bits must be zero in order to be multiple
of 16.
I recommend to everyone the book Hacker's Delight which will make you
truly appreciate how wonderful integers are.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk