[boost] Re: [bloom] Benchmarks with Knuth multiplier-based hash production

17 Jun 2025

      On Tue, Jun 17, 2025 at 3:04 PM Ivan Matek <libbooze@gmail.com> wrote:
...
I may have mixed up something in my computations, please feel free to
correct me.
Or I can correct myself :)
This was the issue, to quote documentation:

*When the default value 0 is used, buckets have the same size as the
subarrays subfilters operate on (non-overlapping case). Otherwise, bucket
size is smaller and subarrays spill over adjacent buckets, which results in
an improved (lower) FPR in exchange for a possibly worse performance due to
memory unalignment.*In other words even though conceptually block had 11
uint64_t, they are not stored with alignment of uint64_t.

But still it may be worthwhile to still investigate how aggressive we want
to be with the prefetching.

For example for TestK==3 all these blocks prefetch 2 cachelines(this is
extreme example, K == 3 is silly value):
                                          filter<int, 1,
multiblock<boost::uint64_t, TestK>, 1>,
                                          filter<int, 1,
multiblock<boost::uint64_t, TestK>>,
                                          filter<int, 1,
fast_multiblock64<TestK>>,
                                          filter<int, 1,
fast_multiblock64<TestK>, 1>

For TestK == 9 these prefetch 3 cachelines:
                                          filter<int, 1,
multiblock<boost::uint64_t, TestK>, 1>,
                                          filter<int, 1,
fast_multiblock64<TestK>>,
                                          filter<int, 1,
fast_multiblock64<TestK>, 1>
while only block that remains on 2 cache lines for K == 9 is:
                                          filter<int, 1,
multiblock<boost::uint64_t, TestK>>,

So it all makes sense considering how filters are specified(e.g. filter<
fast_multiblock64<9>,1> has 72 bytes of state with start address aligned to
1 so it can stretch 3 cachelines), assuming we want to make sure for every
random start position we load all the cachelines we may need, but still
question remains if we really want that.

So questions:

   1. Should we skip prefetch of last cacheline if it is more than 50%(or
   some other %) likely we will not need the last cacheline?
   2. If we do not want to make changes in 1. should we encourage people to
   use block size 64 for fast_multiblock64?

Some performance numbers bellow.

performance difference wrt prefetch can be large for unsuccessful lookup:

here we manually shrink computed number of prefetched cachelines by 1(in
this example we reduced 3 cachelines to 2)

prefetch 2 cachelines
filter                   filter<int, 1ul, fast_multiblock64<9ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00679
FPR                      0.154767
insertion_time           6.14765
successful_lookup_time   5.01657

*unsuccessful_lookup_time 3.59357*mixed_lookup_time        11.4932

prefetch 3 cachelines(current prefetch setup)
filter                   filter<int, 1ul, fast_multiblock64<9ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00679
FPR                      0.154767
insertion_time           6.10838
successful_lookup_time   5.68114

*unsuccessful_lookup_time 5.50207*mixed_lookup_time        11.9192

prefetch 2 cachelines
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00679
FPR                      0.159833
insertion_time           6.01277
successful_lookup_time   4.97918
*unsuccessful_lookup_time 3.6566*
mixed_lookup_time        11.5088

prefetch 3 cachelines(current prefetch setup)
filter                   filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00679
FPR                      0.159833
insertion_time           6.90663
successful_lookup_time   6.14258

*unsuccessful_lookup_time 5.52233*mixed_lookup_time        11.9335

large block size hurts FPR, but makes blocks aligned so we prefetch less
cache lines
all data with prefetching enabled and uses default formula, so no manual
adjustment(but block size 64 causes alignment so we only fetch 2 cache
lines):

filter                   filter<int, 1ul, fast_multiblock64<11ul>, *0ul*,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00695

*FPR                      0.187567*insertion_time           6.00145
successful_lookup_time   5.66594
unsuccessful_lookup_time 5.54643
mixed_lookup_time        12.294
filter                   filter<int, 1ul, fast_multiblock64<11ul>, *1ul*,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00679

*FPR                      0.159833*insertion_time           6.20854
successful_lookup_time   5.74137
unsuccessful_lookup_time 5.6112
mixed_lookup_time        12.1253
filter                   filter<int, 1ul, fast_multiblock64<11ul>, *64*ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.00704

*FPR                      0.2164*insertion_time           4.13648
successful_lookup_time   4.35427
unsuccessful_lookup_time 3.49425
mixed_lookup_time        11.3594

Huge wins for best alignment filter, except in mixed where fact we are
branchful hurts us in all cases.

Speedup is not limited just to AVX2 code, here we see plain multiblock(I am
not sure why FPR did not suffer, I assume noise or hashing artifact):

filter                   filter<int, 1ul, multiblock<unsigned long, 9ul>,
*1ul*, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72205

*FPR                      0.068*insertion_time           7.27091
successful_lookup_time   6.08024
unsuccessful_lookup_time 5.82294
mixed_lookup_time        5.80241
filter                   filter<int, 1ul, multiblock<unsigned long, 9ul>,*
8ul*, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB              5.72207

*FPR                      0.0672*insertion_time           5.09056
successful_lookup_time   4.70166
unsuccessful_lookup_time 4.51264
mixed_lookup_time        4.48538

To be clear it is not just prefetching that is affected by alignment, i.e.
code has tag dispatch based on if data is aligned or not in get function.

Regards,
Ivan