Re: [boost] [bloom] Benchmarks with Knuth multiplier-based hash production

12 Jun 2025

      On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost <
boost@lists.boost.org> wrote:
...
Unlike fast_multiblock*, [multi]block is branchless, so your results
make sense. I wonder if multiblock<K> should be made branchful as
the working memory area can get quite large for large values of K.
Can you elaborate why would you want to change multiblock?
To save memory accesses that are unnecessary and slow down rest of the
program because they occupy cache or purely to reduce number of
instructions executed?

If that is the case on my CPU it does not seem the matter a lot, i.e.
changing AVX2 to be branchless does not help or hurt performance, maybe CPU
anyway decides to fetch data...
This is what I did to make avx code branchless, did not affect performance
in a way I could detect:

static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash)
{
  bool found = true;
  for(int i=0;i<k/8;++i){
    found = found && check_m256ix2(x[i],hash,8);
    hash=detail::mulx64(hash);
  }
  if constexpr (k%8){
    found = found && check_m256ix2(x[k/8],hash,k%8);
  }
  return found;
}

But back to the multiblock and making it branchful:
With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) it
outperforms fast_multiblock for L2, L3, RAM size filters when it comes to
mixed lookups, so I am not certain it needs changing.

Now for unsuccessful lookups fast_multiblock wins a little, but I am not
sure for all use cases unsuccessful lookups will be close 100%.

Maybe best option is to allow users to configure if filter is branchy or
not, but with so many filters, not sure if another config option is best,
as it may confuse users.

In any case here are numbers for 2 similar filters on my machine: 1M, 3M,
35M(I have lot of L3 cache so I needed large number of values to go over
L3).

*CLANG*
*1M/L2*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0581
insertion_time           4.30731
successful_lookup_time   4.62453
unsuccessful_lookup_time 4.64252
mixed_lookup_time        4.6812
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0629
insertion_time           4.33433
successful_lookup_time   4.47549
unsuccessful_lookup_time 3.46791
mixed_lookup_time        11.0634

*3M/L3*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72205
FPR                      0.0609667
insertion_time           7.01126
successful_lookup_time   6.40535
unsuccessful_lookup_time 6.30024
mixed_lookup_time        6.20703
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72205
FPR                      0.0632
insertion_time           7.17907
successful_lookup_time   6.13649
unsuccessful_lookup_time 6.17347
mixed_lookup_time        12.0361

*35M/RAM*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0635571
insertion_time           19.3996
successful_lookup_time   18.6527
unsuccessful_lookup_time 18.1346
mixed_lookup_time        18.2745
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.06368
insertion_time           18.9966
successful_lookup_time   18.729
unsuccessful_lookup_time 16.1566
mixed_lookup_time        21.5246

in case somebody is curious GCC 14 numbers:
*GCC*
*1M*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0581
insertion_time           13.0513
successful_lookup_time   8.06424
unsuccessful_lookup_time 8.09518
mixed_lookup_time        8.21242
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0629
insertion_time           5.68284
successful_lookup_time   5.7248
unsuccessful_lookup_time 4.32124
mixed_lookup_time        11.7601

*3M*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72205
FPR                      0.0609667
insertion_time           14.0387
successful_lookup_time   9.44954
unsuccessful_lookup_time 9.3228
mixed_lookup_time        9.41488
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72205
FPR                      0.0632
insertion_time           6.97644
successful_lookup_time   6.69201
unsuccessful_lookup_time 5.9029
mixed_lookup_time        12.6044

*35M*
filter                   filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0635571
insertion_time           26.4349
successful_lookup_time   32.8604
unsuccessful_lookup_time 33.0485
mixed_lookup_time        33.0334
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.06368
insertion_time           20.6438
successful_lookup_time   18.1056
unsuccessful_lookup_time 16.2122
mixed_lookup_time        23.2072

// in some cases over 50% slower than clang

What I find quite interesting is that if you only benchmark with clang or
only with gcc you will get totally different conclusions about which filter
is faster, not to mention that before measuring I would assume that for 35M
case memory latency
dominates so differences in optimization will not matter...

For example:
35M
clang:
mixed_lookup_time        18.2745
vs
mixed_lookup_time        21.5246

gcc:
mixed_lookup_time        33.0334
vs
mixed_lookup_time        23.2072

P.S. I am obviously comparing clang and gcc for just 1 benchmark, please do
not extrapolate this to general optimizer performance.

Re: [boost] [bloom] Benchmarks with Knuth multiplier-based hash production

Ivan Matek