
El 10/06/2025 a las 14:52, Ivan Matek escribió:
One more thing I noticed/checked and I believe may be of interest, again not in particular for this proposed Knuth multiplier change, but in general.
Some filters have much weaker performance for mixed lookup. To be honest I have expected that for some(branch misprediction) but I do not know enough about all filters to say if slowdowns I have observed are as expected. Interesting that some fast_ have bad performance for mixed lookup, although naively I would assume SIMD code is branchless. I presume the issue is that sometimes we operate on more than 256 bits so there is branching...
fast_multiblock* has branches (early termination on lookup) at k = 8, 16, ..., that is, whenever a full subblock (a pair of __m128is, a __m256i, a pair of __m256is or a uint32x4x2_t, depending on the subfilter and the SIMD technology) has been processed.
What I find interesting beside this slowdown is that unordered_flat_set also has bad performance for mixed lookup so bloom filters that do not degrade on mixed lookup look much better than if we just compare 100%/0% cases. For example if we round performance to 1 decimal digit and compare unordered_flat set 4.7/3.1/*9.4* vs filter<int, 1ul, block<unsigned long, 7ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> 3.3/3.3/*3.3*
comparison looks much much different if we drop mixed performance(bolded part).
Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K. Joaquin M Lopez Munoz