
El 12/06/2025 a las 13:15, Ivan Matek escribió:
On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K.
Can you elaborate why would you want to change multiblock? To save memory accesses that are unnecessary and slow down rest of the program because they occupy cache or purely to reduce number of instructions executed?
Well, both. Consider the following cost model: * Succesful lookup executes in A*K cycles for some constant factor A. * Branchless unsuccesful lookup is also A*K, exactly the same as succesful lookup. * Branchful unsuccesful lookup when the array is half full executes in B*(1*0.5 + 2*0.25 + 3*0.125 + ... + K*2^-K-1) cycles . Since the sum decays exponentially, this is roughly equivalent to some constant C irrespective of K. Now, the thing is that for a sufficiently large K, A*K (branchless) is going to be larger than C (branchful). The best approach would be then to have a hybrid algorithm that branches every L operations. Determining such a value is an interesting experiment.
If that is the case on my CPU it does not seem the matter a lot, i.e. changing AVX2 to be branchless does not help or hurt performance, maybe CPU anyway decides to fetch data...
This is what I did to make avx code branchless, did not affect performance in a way I could detect: [...] found = found && check_m256ix2(x[i],hash,8);
This is short circuiting (at least in theory). You may try using bitwise AND (&). Also, try with large values of K: you should reach a point where branchless is worse.
But back to the multiblock and making it branchful: With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) it outperforms fast_multiblock for L2, L3, RAM size filters when it comes to mixed lookups, so I am not certain it needs changing.
Yes, it's hard to decide based on these numbers. For branchless lookup, mixed lookup obviously takes the same as successful or unsuccesful lookup. For branchful lookup, a cost model for mixed lookup would be P*M*(A*K) + (1-P)*M*C where P is the percentange of successful lookups and M is an extra cost factor on top of the non-mixed approach due to branch mispredictions. Again, an interesting area to investigate. You're giving me a lot of work :-) I'd rather release the lib now in time for Boost 1.89 and then postpone this analysis for 1.90, as it's going to take some work. Joaquin M Lopez Munoz