Re: [boost] [bloom] Benchmarks with Knuth multiplier-based hash production

13 Jun 2025

      On Thu, Jun 12, 2025 at 8:56 PM Joaquin M López Muñoz via Boost <
boost@lists.boost.org> wrote:
...
This is short circuiting (at least in theory). You may try using bitwise
AND (&). Also,
Doh. :) Thank you for catching that.
On my machine when I do that difference is miniscule/noise.
I have bumped up now K to 14, so numbers are not directly comparable to
previous numbers, but most important thing is that there is no
difference(for Clang).

Clang
*1M*
filter                   filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0737
insertion_time           4.97109
successful_lookup_time   5.75406
unsuccessful_lookup_time 5.73877
mixed_lookup_time        5.77369
filter                   filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0736
insertion_time           5.19262
successful_lookup_time   5.8273
unsuccessful_lookup_time 5.81782
mixed_lookup_time        5.81926

*35M*
filter                   filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0752343
insertion_time           19.7033
successful_lookup_time   19.8497
unsuccessful_lookup_time 20.9164
mixed_lookup_time        20.6151
filter                   filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0748914
insertion_time           19.734
successful_lookup_time   20.3514
unsuccessful_lookup_time 19.5756
mixed_lookup_time        19.375

as before GCC performance for multiblock is bad, so now SIMD code easily
wins

GCC

*1M*filter                   filter<int, 1ul, multiblock<unsigned long,
14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0737
insertion_time           11.846
successful_lookup_time   9.76772
unsuccessful_lookup_time 9.80503
mixed_lookup_time        9.81788
filter                   filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              1.90735
FPR                      0.0736
insertion_time           6.17691
successful_lookup_time   6.43346
unsuccessful_lookup_time 6.45574
mixed_lookup_time        6.4289

*35M*
filter                   filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0752343
insertion_time           30.3168
successful_lookup_time   36.3754
unsuccessful_lookup_time 37.8927
mixed_lookup_time        38.045
filter                   filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              66.7572
FPR                      0.0748914
insertion_time           21.2682
successful_lookup_time   21.2692
unsuccessful_lookup_time 21.0058
mixed_lookup_time        19.7536

As before unsuccessful lookups get worse slightly compared to branchful
version, as expected, since much more extra work is done.

SIMD code branchful vs branchless for 3M(Clang):

successful_lookup_time   6.5231
unsuccessful_lookup_time 5.71519
mixed_lookup_time        12.8607

insertion_time           6.44847
successful_lookup_time   6.68598
unsuccessful_lookup_time 6.66759
mixed_lookup_time        6.71642
...
Again, an interesting area to investigate. You're giving
me a lot of work :-)
Too hard work for me to help with :), but one speculative idea: it may be
reasonable to prepare lookups(i.e. make_m256ix2(hash,kp); ) for all hashes
unconditionally, but still have branchy code
in terms of breaking when first  _mm256_testc_si256 detects no match.

Of simpler tasks: if you want I can try to make pull request for adding
mixed results to benchmark.
Downside is that benchmark already has a lot of columns and rows so users
might find it confusing.
On the other hand I think mixed benchmark is valuable info for people who
will not have 90% + of hits or misses in their usages.
...
I'd rather release the lib now in time for Boost 1.89 and then postpone
this analysis for 1.90, as it's going to take some work.
I agree, I never meant to imply this is critical fix needed situation, I
just found it quite interesting.

for the record this is check now:

static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash)
{
  bool found = true;
  for(int i=0;i<k/8;++i){
    found &= check_m256ix2(x[i],hash,8);
    hash=detail::mulx64(hash);
  }
  if constexpr (k%8){
    found &= check_m256ix2(x[k/8],hash,k%8);
  }
  return found;
}

Re: [boost] [bloom] Benchmarks with Knuth multiplier-based hash production

Ivan Matek