Re: [boost] [bloom] Benchmarks with Knuth multiplier-based hash production

12 Jun 2025

      El 12/06/2025 a las 13:15, Ivan Matek escribió:
...
On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost 
<boost@lists.boost.org> wrote:
Unlike fast_multiblock*, [multi]block is branchless, so your results
    make sense. I wonder if multiblock<K> should be made branchful as
    the working memory area can get quite large for large values of K.
Can you elaborate why would you want to change multiblock?
To save memory accesses that are unnecessary and slow down rest of the 
program
because they occupy cache or purely to reduce number of instructions 
executed?
Well, both. Consider the following cost model:

* Succesful lookup executes in A*K cycles for some constant factor A.
* Branchless unsuccesful lookup is also A*K, exactly the same as succesful
lookup.
* Branchful unsuccesful lookup when the array is half full executes in
B*(1*0.5 + 2*0.25 + 3*0.125 + ... + K*2^-K-1) cycles . Since the sum
decays exponentially, this is roughly equivalent to some constant C
irrespective of K.

Now, the thing is that for a sufficiently large K, A*K (branchless) is 
going to
be larger than C (branchful). The best approach would be then to have a
hybrid algorithm that branches every L operations. Determining such a
value is an interesting experiment.
...
If that is the case on my CPU it does not seem the matter a lot, i.e. 
changing AVX2
to be branchless does not help or hurt performance, maybe CPU anyway 
decides
to fetch data...
This is what I did to make avx code branchless, did not affect 
performance in a way
I could detect:
 [...]
found = found && check_m256ix2(x[i],hash,8);
This is short circuiting (at least in theory). You may try using bitwise 
AND (&). Also,
try with large values of K: you should reach a point where branchless is 
worse.
...
But back to the multiblock and making it branchful:
With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) 
it outperforms
fast_multiblock for L2, L3, RAM size filters when it comes to mixed 
lookups, so I am not
certain it needs changing.
Yes, it's hard to decide based on these numbers. For branchless
lookup, mixed lookup obviously takes the same as successful or
unsuccesful lookup. For branchful lookup, a cost model for mixed
lookup would be

P*M*(A*K) + (1-P)*M*C

where P is the percentange of successful lookups and M is an
extra cost factor on top of the non-mixed approach due to branch
mispredictions. Again, an interesting area to investigate. You're giving
me a lot of work :-)

I'd rather release the lib now in time for Boost 1.89 and then postpone
this analysis for 1.90, as it's going to take some work.

Joaquin M Lopez Munoz