
On Thu, Jun 12, 2025 at 8:56 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
This is short circuiting (at least in theory). You may try using bitwise AND (&). Also,
Doh. :) Thank you for catching that. On my machine when I do that difference is miniscule/noise. I have bumped up now K to 14, so numbers are not directly comparable to previous numbers, but most important thing is that there is no difference(for Clang). Clang *1M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0737 insertion_time 4.97109 successful_lookup_time 5.75406 unsuccessful_lookup_time 5.73877 mixed_lookup_time 5.77369 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0736 insertion_time 5.19262 successful_lookup_time 5.8273 unsuccessful_lookup_time 5.81782 mixed_lookup_time 5.81926 *35M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0752343 insertion_time 19.7033 successful_lookup_time 19.8497 unsuccessful_lookup_time 20.9164 mixed_lookup_time 20.6151 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0748914 insertion_time 19.734 successful_lookup_time 20.3514 unsuccessful_lookup_time 19.5756 mixed_lookup_time 19.375 as before GCC performance for multiblock is bad, so now SIMD code easily wins GCC *1M*filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0737 insertion_time 11.846 successful_lookup_time 9.76772 unsuccessful_lookup_time 9.80503 mixed_lookup_time 9.81788 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0736 insertion_time 6.17691 successful_lookup_time 6.43346 unsuccessful_lookup_time 6.45574 mixed_lookup_time 6.4289 *35M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0752343 insertion_time 30.3168 successful_lookup_time 36.3754 unsuccessful_lookup_time 37.8927 mixed_lookup_time 38.045 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0748914 insertion_time 21.2682 successful_lookup_time 21.2692 unsuccessful_lookup_time 21.0058 mixed_lookup_time 19.7536 As before unsuccessful lookups get worse slightly compared to branchful version, as expected, since much more extra work is done. SIMD code branchful vs branchless for 3M(Clang): successful_lookup_time 6.5231 unsuccessful_lookup_time 5.71519 mixed_lookup_time 12.8607 insertion_time 6.44847 successful_lookup_time 6.68598 unsuccessful_lookup_time 6.66759 mixed_lookup_time 6.71642
Again, an interesting area to investigate. You're giving me a lot of work :-)
Too hard work for me to help with :), but one speculative idea: it may be reasonable to prepare lookups(i.e. make_m256ix2(hash,kp); ) for all hashes unconditionally, but still have branchy code in terms of breaking when first _mm256_testc_si256 detects no match. Of simpler tasks: if you want I can try to make pull request for adding mixed results to benchmark. Downside is that benchmark already has a lot of columns and rows so users might find it confusing. On the other hand I think mixed benchmark is valuable info for people who will not have 90% + of hits or misses in their usages.
I'd rather release the lib now in time for Boost 1.89 and then postpone this analysis for 1.90, as it's going to take some work.
I agree, I never meant to imply this is critical fix needed situation, I just found it quite interesting. for the record this is check now: static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { bool found = true; for(int i=0;i<k/8;++i){ found &= check_m256ix2(x[i],hash,8); hash=detail::mulx64(hash); } if constexpr (k%8){ found &= check_m256ix2(x[k/8],hash,k%8); } return found; }