
El 14/06/2025 a las 19:32, Ivan Matek escribió:
One more datapoint that confirms what you suggested(and what intuitively makes sense), that for large K effect of breaking early for unsuccessful lookups looks more impressive(when filter fits in L2).
[...]
There is also another interesting fact about performance that will take a bit to explain so I ask for patience: fast_multiblock<8> is much faster than fast_block<11> for unsuccessful_lookup_time, this does not initially make sense because we would expect we execute same amount of instructions, and branch is easily predictable since 100% of lookups fail(beside FPR). I am not asm expert, but it seems clang interleaves SIMD for for 11 case so some stuff that is not strictly necessary before branch is done before branch. Now this is great when we do not branch(early return) but is waste when we have close to 100% unsuccessful lookups.
Yes, this is odd. If your thesis is correct, the budget optimizer of the compiler seems to have decided to go branchless despite what the code says.
This made me investigate a bit and on my machine I can get really great performance for unsuccessful_lookup_time when I combine SIMD and scalar filter for "remainder" (e.g. SIMD 8 , scalar 3 for K==11).
This is an idea that deserves some investigation. When K%8 is small, going scalar could actually be faster than SIMD. This is what your experiments seem to suggest anyway.
This combined filter(bobw_filter) has really great performance for unsuccessful lookups[...]
[...]
I understand that mixing 2 filters instead of modifying one to have mix of simd and scalar code is not really greatest comparison, but it was much easier for me to make combined filter, than to figure out how to mix scalar and simd filter code in one filter.
If you want to keep on playing, please find attached a sketch of a bobw_multiblock64_avx2 filter for your enjoyment. (Why the "bobw" preffix btw?) Joaquin M Lopez Munoz