
One more datapoint that confirms what you suggested(and what intuitively makes sense), that for large K effect of breaking early for unsuccessful lookups looks more impressive(when filter fits in L2). Here is K == 19 results. I picked K == 19 because it is bigger than 16, no other reason, maybe it is silly number to use for real bloom filter. Clang results only. *1M* branchless: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 2.38419 FPR 0.0174 insertion_time 6.89076 successful_lookup_time 7.72433 unsuccessful_lookup_time 7.76004 mixed_lookup_time 7.78691 branchful: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 2.38419 FPR 0.0174 insertion_time 6.808 successful_lookup_time 7.09307 unsuccessful_lookup_time 4.9642 mixed_lookup_time 12.8065 For L3 sizes branchless code performs much better relatively even though branchy code still wins for unsuccessful, I would assume because of large memory latencies. *3M* branchless: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 7.15256 FPR 0.0153667 insertion_time 10.1734 successful_lookup_time 9.1581 unsuccessful_lookup_time 9.0949 mixed_lookup_time 9.10151 branchful: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 7.15256 FPR 0.0153667 insertion_time 8.9318 successful_lookup_time 8.79925 unsuccessful_lookup_time 8.34029 mixed_lookup_time 14.4601 There is also another interesting fact about performance that will take a bit to explain so I ask for patience: fast_multiblock<8> is much faster than fast_block<11> for unsuccessful_lookup_time, this does not initially make sense because we would expect we execute same amount of instructions, and branch is easily predictable since 100% of lookups fail(beside FPR). I am not asm expert, but it seems clang interleaves SIMD for for 11 case so some stuff that is not strictly necessary before branch is done before branch. Now this is great when we do not branch(early return) but is waste when we have close to 100% unsuccessful lookups. This made me investigate a bit and on my machine I can get really great performance for unsuccessful_lookup_time when I combine SIMD and scalar filter for "remainder"(e.g. SIMD 8 , scalar 3 for K==11). This combined filter(bobw_filter) has really great performance for unsuccessful lookups, unclear if this is just microbenchmark anomaly(if there was some extra code as in real application maybe extra simd code for fast_multiblock64<11ul> being executed would not matter). *3M* filter bobw_filter<11ul> // fast_multiblock64<8> + multiblock<3> capacity MB 5.72205 FPR 0.0555667 insertion_time 7.7845 successful_lookup_time 6.48931 unsuccessful_lookup_time 3.45824 mixed_lookup_time 12.2271 filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 6.6754 successful_lookup_time 6.161 unsuccessful_lookup_time 6.19519 mixed_lookup_time 6.09886 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> // branchful capacity MB 5.72205 FPR 0.0632 insertion_time 6.33386 successful_lookup_time 5.87341 unsuccessful_lookup_time 5.86074 mixed_lookup_time 12.0457 I would not bother you for small difference, but 3.46 vs 6.20 and 5.86 looks quite impressive, with disclaimer I mentioned before that it may be just tight microbenchmark difference, and this is just for cases when people are purely focused on unsuccessful lookups. Advantage holds even for large number of elements for unsuccessful lookup, but others look quite bad relatively *35M* filter bobw_filter<11ul> capacity MB 66.7572 FPR 0.0569229 insertion_time 23.9595 successful_lookup_time 22.1188 unsuccessful_lookup_time 10.6815 mixed_lookup_time 24.646 filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 19.1072 successful_lookup_time 18.0308 unsuccessful_lookup_time 18.0789 mixed_lookup_time 18.0465 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 18.6431 successful_lookup_time 16.2872 unsuccessful_lookup_time 14.8029 mixed_lookup_time 21.1569 I understand that mixing 2 filters instead of modifying one to have mix of simd and scalar code is not really greatest comparison, but it was much easier for me to make combined filter, than to figure out how to mix scalar and simd filter code in one filter. template<std::size_t K> struct bobw_filter { static constexpr bool has_scalar = K%4; bobw_filter() = default; static constexpr size_t scalar_capacity(size_t full_capacity) { return full_capacity*(K%4)/K; } static constexpr size_t simd_capacity(const size_t full_capacity) { return full_capacity*(K/4*4)/K; } explicit bobw_filter(size_t capacity) : simd_f(simd_capacity(capacity)) { if constexpr(has_scalar) { assert(scalar_capacity(capacity)!=0); scalar_f.reset(scalar_capacity(capacity)); } } // TODO: is this mulx64 correct thing to do, or we should apply it on hash of x? BOOST_FORCEINLINE void insert(const int& x) { simd_f.insert(x); if constexpr(has_scalar) { scalar_f.insert(detail::mulx64(x)); } } BOOST_FORCEINLINE bool may_contain(const int& x) { if constexpr(has_scalar) { return simd_f.may_contain(x) && scalar_f.may_contain(detail::mulx64(x)); } else { return simd_f.may_contain(x); } } auto capacity() { if constexpr(has_scalar) { return scalar_f.capacity() + simd_f.capacity(); } else { return simd_f.capacity(); } } [[no_unique_address]] std::conditional_t<has_scalar, filter<int, 1, multiblock<boost::uint64_t, K % 4>, 1 >, std::monostate> scalar_f; filter<int, 1, fast_multiblock64<K / 4 * 4>, 1> simd_f; using value_type = typename decltype(simd_f)::value_type; };