
On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K.
Can you elaborate why would you want to change multiblock?
To save memory accesses that are unnecessary and slow down rest of the program because they occupy cache or purely to reduce number of instructions executed? If that is the case on my CPU it does not seem the matter a lot, i.e. changing AVX2 to be branchless does not help or hurt performance, maybe CPU anyway decides to fetch data... This is what I did to make avx code branchless, did not affect performance in a way I could detect: static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { bool found = true; for(int i=0;i<k/8;++i){ found = found && check_m256ix2(x[i],hash,8); hash=detail::mulx64(hash); } if constexpr (k%8){ found = found && check_m256ix2(x[k/8],hash,k%8); } return found; } But back to the multiblock and making it branchful: With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) it outperforms fast_multiblock for L2, L3, RAM size filters when it comes to mixed lookups, so I am not certain it needs changing. Now for unsuccessful lookups fast_multiblock wins a little, but I am not sure for all use cases unsuccessful lookups will be close 100%. Maybe best option is to allow users to configure if filter is branchy or not, but with so many filters, not sure if another config option is best, as it may confuse users. In any case here are numbers for 2 similar filters on my machine: 1M, 3M, 35M(I have lot of L3 cache so I needed large number of values to go over L3). *CLANG* *1M/L2* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0581 insertion_time 4.30731 successful_lookup_time 4.62453 unsuccessful_lookup_time 4.64252 mixed_lookup_time 4.6812 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0629 insertion_time 4.33433 successful_lookup_time 4.47549 unsuccessful_lookup_time 3.46791 mixed_lookup_time 11.0634 *3M/L3* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 7.01126 successful_lookup_time 6.40535 unsuccessful_lookup_time 6.30024 mixed_lookup_time 6.20703 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0632 insertion_time 7.17907 successful_lookup_time 6.13649 unsuccessful_lookup_time 6.17347 mixed_lookup_time 12.0361 *35M/RAM* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 19.3996 successful_lookup_time 18.6527 unsuccessful_lookup_time 18.1346 mixed_lookup_time 18.2745 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 18.9966 successful_lookup_time 18.729 unsuccessful_lookup_time 16.1566 mixed_lookup_time 21.5246 in case somebody is curious GCC 14 numbers: *GCC* *1M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0581 insertion_time 13.0513 successful_lookup_time 8.06424 unsuccessful_lookup_time 8.09518 mixed_lookup_time 8.21242 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0629 insertion_time 5.68284 successful_lookup_time 5.7248 unsuccessful_lookup_time 4.32124 mixed_lookup_time 11.7601 *3M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 14.0387 successful_lookup_time 9.44954 unsuccessful_lookup_time 9.3228 mixed_lookup_time 9.41488 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0632 insertion_time 6.97644 successful_lookup_time 6.69201 unsuccessful_lookup_time 5.9029 mixed_lookup_time 12.6044 *35M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 26.4349 successful_lookup_time 32.8604 unsuccessful_lookup_time 33.0485 mixed_lookup_time 33.0334 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 20.6438 successful_lookup_time 18.1056 unsuccessful_lookup_time 16.2122 mixed_lookup_time 23.2072 // in some cases over 50% slower than clang What I find quite interesting is that if you only benchmark with clang or only with gcc you will get totally different conclusions about which filter is faster, not to mention that before measuring I would assume that for 35M case memory latency dominates so differences in optimization will not matter... For example: 35M clang: mixed_lookup_time 18.2745 vs mixed_lookup_time 21.5246 gcc: mixed_lookup_time 33.0334 vs mixed_lookup_time 23.2072 P.S. I am obviously comparing clang and gcc for just 1 benchmark, please do not extrapolate this to general optimizer performance.