
On Tue, Jun 17, 2025 at 3:04 PM Ivan Matek <libbooze@gmail.com> wrote:
I may have mixed up something in my computations, please feel free to correct me.
Or I can correct myself :) This was the issue, to quote documentation: *When the default value 0 is used, buckets have the same size as the subarrays subfilters operate on (non-overlapping case). Otherwise, bucket size is smaller and subarrays spill over adjacent buckets, which results in an improved (lower) FPR in exchange for a possibly worse performance due to memory unalignment.*In other words even though conceptually block had 11 uint64_t, they are not stored with alignment of uint64_t. But still it may be worthwhile to still investigate how aggressive we want to be with the prefetching. For example for TestK==3 all these blocks prefetch 2 cachelines(this is extreme example, K == 3 is silly value): filter<int, 1, multiblock<boost::uint64_t, TestK>, 1>, filter<int, 1, multiblock<boost::uint64_t, TestK>>, filter<int, 1, fast_multiblock64<TestK>>, filter<int, 1, fast_multiblock64<TestK>, 1> For TestK == 9 these prefetch 3 cachelines: filter<int, 1, multiblock<boost::uint64_t, TestK>, 1>, filter<int, 1, fast_multiblock64<TestK>>, filter<int, 1, fast_multiblock64<TestK>, 1> while only block that remains on 2 cache lines for K == 9 is: filter<int, 1, multiblock<boost::uint64_t, TestK>>, So it all makes sense considering how filters are specified(e.g. filter< fast_multiblock64<9>,1> has 72 bytes of state with start address aligned to 1 so it can stretch 3 cachelines), assuming we want to make sure for every random start position we load all the cachelines we may need, but still question remains if we really want that. So questions: 1. Should we skip prefetch of last cacheline if it is more than 50%(or some other %) likely we will not need the last cacheline? 2. If we do not want to make changes in 1. should we encourage people to use block size 64 for fast_multiblock64? Some performance numbers bellow. performance difference wrt prefetch can be large for unsuccessful lookup: here we manually shrink computed number of prefetched cachelines by 1(in this example we reduced 3 cachelines to 2) prefetch 2 cachelines filter filter<int, 1ul, fast_multiblock64<9ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.154767 insertion_time 6.14765 successful_lookup_time 5.01657 *unsuccessful_lookup_time 3.59357*mixed_lookup_time 11.4932 prefetch 3 cachelines(current prefetch setup) filter filter<int, 1ul, fast_multiblock64<9ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.154767 insertion_time 6.10838 successful_lookup_time 5.68114 *unsuccessful_lookup_time 5.50207*mixed_lookup_time 11.9192 prefetch 2 cachelines filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.01277 successful_lookup_time 4.97918 *unsuccessful_lookup_time 3.6566* mixed_lookup_time 11.5088 prefetch 3 cachelines(current prefetch setup) filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.90663 successful_lookup_time 6.14258 *unsuccessful_lookup_time 5.52233*mixed_lookup_time 11.9335 large block size hurts FPR, but makes blocks aligned so we prefetch less cache lines all data with prefetching enabled and uses default formula, so no manual adjustment(but block size 64 causes alignment so we only fetch 2 cache lines): filter filter<int, 1ul, fast_multiblock64<11ul>, *0ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00695 *FPR 0.187567*insertion_time 6.00145 successful_lookup_time 5.66594 unsuccessful_lookup_time 5.54643 mixed_lookup_time 12.294 filter filter<int, 1ul, fast_multiblock64<11ul>, *1ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 *FPR 0.159833*insertion_time 6.20854 successful_lookup_time 5.74137 unsuccessful_lookup_time 5.6112 mixed_lookup_time 12.1253 filter filter<int, 1ul, fast_multiblock64<11ul>, *64*ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00704 *FPR 0.2164*insertion_time 4.13648 successful_lookup_time 4.35427 unsuccessful_lookup_time 3.49425 mixed_lookup_time 11.3594 Huge wins for best alignment filter, except in mixed where fact we are branchful hurts us in all cases. Speedup is not limited just to AVX2 code, here we see plain multiblock(I am not sure why FPR did not suffer, I assume noise or hashing artifact): filter filter<int, 1ul, multiblock<unsigned long, 9ul>, *1ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 *FPR 0.068*insertion_time 7.27091 successful_lookup_time 6.08024 unsuccessful_lookup_time 5.82294 mixed_lookup_time 5.80241 filter filter<int, 1ul, multiblock<unsigned long, 9ul>,* 8ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72207 *FPR 0.0672*insertion_time 5.09056 successful_lookup_time 4.70166 unsuccessful_lookup_time 4.51264 mixed_lookup_time 4.48538 To be clear it is not just prefetching that is affected by alignment, i.e. code has tag dispatch based on if data is aligned or not in get function. Regards, Ivan