[bloom] Benchmarks with Knuth multiplier-based hash production

newer
[DynamicBitset] Write access to...

Joaquin M López Muñoz

6 Jun 2025 6 Jun '25

1:02 p.m.

Hi, As per feedback given here: https://lists.boost.org/Archives/boost//2025/05/259548.php I've experimented with the hash production scheme suggested by Kostas, where h_{i+1} is calculated as h_i * 6364136223846793005 mod 2^64: https://github.com/joaquintides/bloom/blob/refs/heads/feature/alternative-ha... These benchmarks show the performance of each filter configuration with the original mechanism and, on the next row, the new mechanism (filter names marked with *): https://github.com/joaquintides/boost_bloom_benchmarks/blob/alternative-hash... I can't draw any definitive conclusion as to whether the new mechanism affects performance, probably because the GHA-run benchmarks have a good deal of noise and variations in CPU speed. Maybe a keener eye than mine can tell, or if you're interested in trying the benchmark locally please tell me so and I can instruct you on how to do it. Feedback welcome, Joaquin M Lopez Munoz

Show replies by date

Ivan Matek

6 Jun 6 Jun

2:06 p.m.

On Fri, Jun 6, 2025 at 9:02 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

Feedback welcome,

Just a general question about benchmarks, not this one in particular. If you are doing something in a loop(e.g. this one <https://github.com/joaquintides/bloom/blob/1f0f953196c8e16e1c5fd8f62e18f4883dbcd44e/benchmark/comparison_table.cpp#L161>) did you check compiler did not unroll it in one case and not in another. You might say this is "natural" in a sense you did not tell compiler to unroll or not unroll the loop, but I would disagree partially since in real code it is unlikely that people will always iterate over an array of contiguous data, in real code they might get some data parse it, then do one lookup, get more data, parse it, do one lookup... I am not saying this is likely to happen, but may be worth checking. because long time ago I was playing around with benchmarking some hash map code and "weird" results that did not made sense were caused because compiler unrolled loop in one case, and not in another. After I did #pragma <https://releases.llvm.org/4.0.0/tools/clang/docs/AttributeReference.html#pragma-unroll-pragma-nounroll> to disable loop unrolling results made sense.

Joaquin M López Muñoz

7 Jun 7 Jun

12:05 p.m.

El 06/06/2025 a las 22:06, Ivan Matek escribió:

...

On Fri, Jun 6, 2025 at 9:02 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

Feedback welcome,

Just a general question about benchmarks, not this one in particular. If you are doing something in a loop(e.g. this one <https://github.com/joaquintides/bloom/blob/1f0f953196c8e16e1c5fd8f62e18f4883dbcd44e/benchmark/comparison_table.cpp#L161>) did you check compiler did not unroll it in one case and not in another. You might say this is "natural" in a sense you did not tell compiler to unroll or not unroll the loop, but I would disagree partially since in real code it is unlikely that people will always iterate over an array of contiguous data, in real code they might get some data parse it, then do one lookup, get more data, parse it, do one lookup... I am not saying this is likely to happen, but may be worth checking. because long time ago I was playing around with benchmarking some hash map code and "weird" results that did not made sense were caused because compiler unrolled loop in one case, and not in another. After I did #pragma <https://releases.llvm.org/4.0.0/tools/clang/docs/AttributeReference.html#pragma-unroll-pragma-nounroll> to disable loop unrolling results made sense.

I haven't examined the codegen so I can't tell for sure. My intuition is that loop unrolling won't take place because the body of the loop is not trivial and unrolling would add a lot of pressure to the instruction cache and take up too many registers for effective pipelining. Anyway, why don't you run it locally and play with the #pragmas? Besides, I'm interested in results outside my local machine and GHA. You just have to compile this in release mode (note the repo branch): https://github.com/joaquintides/bloom/blob/feature/alternative-hash-producti... and execute with ./a.out 1000000 (for 1M elements, you can try 10M elements or any other value as well). Thank you! Joaquin M Lopez Munoz

Ivan Matek

8 Jun 8 Jun

9:13 a.m.

On Sat, Jun 7, 2025 at 8:05 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

Anyway, why don't you run it locally and play with the #pragmas?

Because when I quickly go to benchmark something 9 hours later I am just quickly benchmarking something :) Also assuring reproducibility is pain, e.g. I do not have unused machine on which I can SSH into, to avoid my browser use or random background process messing with benchmark, especially considering bloom uses L3 cache a lot.

...

Besides, I'm interested in results outside my local machine and GHA. You just have to compile this in release mode (note the repo branch):

https://github.com/joaquintides/bloom/blob/feature/alternative-hash-producti...

Well it was more complicated since I already have modular boost on my machine so I had to do some hacks to get CMakeLists.txt to work and also benchmark did not have CMakeLists.txt, and also I did use march=native, mtune=native instead of what your scripts do... But to quickly recap: 1. There seems to be no unrolling happening without me doing it with pragmas. 2. I have increased constants to reduce chance of noise affecting results: - static const int num_trials=10; - static const milliseconds min_time_per_trial(10); + static const int num_trials=20; + static const milliseconds min_time_per_trial(50); 3. I did this to make tables more aligned: - "<table>\n" + "<table style=\"font-family: monospace\">\n" 4. In terms of benchmark setup I would add 5% of "opposite" lookups(e.g. success in failures) since I presume current setup does not penalize branchy code as realistic scenarios would(although it is possible real code might also might have close to 100% of successes or failures). Just to be clear: I did not make this change. 5. I would suggest to to consider switching benchmark repo to use native instead of mavx2 my tests were of form: taskset --cpu-list 0 {binary} {number} >> {description}.html cpu was i7-13700H, core speed was not locked, range between 3.2 and 3.8GHz, it is possible avx code was affecting cpu speed, but did not check, could be just accumulated heat. flags: FLAGS = -O3 -DNDEBUG -fcolor-diagnostics -march=native -mtune=native I have attached 2 runs so you can see the noise of measurement on my machine. I have also attached one unrolled run, just to see it can cause difference, but as I said this does not matter much since by default clang does not unroll.

Joaquin M López Muñoz

9 Jun 9 Jun

2:31 a.m.

El 08/06/2025 a las 17:13, Ivan Matek escribió:

...

On Sat, Jun 7, 2025 at 8:05 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

Anyway, why don't you run it locally and play with the #pragmas?

Because when I quickly go to benchmark something 9 hours later I am just quickly benchmarking something :) Also assuring reproducibility is pain, e.g. I do not have unused machine on which I can SSH into, to avoid my browser use or random background process messing with benchmark, especially considering bloom uses L3 cache a lot.

Hey, thanks so much for running the benchmarks! Yes, variance hurts analysis. I'm plannning to move my GHA-based benchmarks to dedicated machines so that results are more stable.

...

Besides, I'm interested in results outside my local machine and GHA. You just have to compile this in release mode (note the repo branch):

https://github.com/joaquintides/bloom/blob/feature/alternative-hash-producti...

Well it was more complicated since I already have modular boost on my machine so I had to do some hacks to get CMakeLists.txt to work and also benchmark did not have CMakeLists.txt, and also I did use march=native, mtune=native instead of what your scripts do...

But to quickly recap:

1. There seems to be no unrolling happening without me doing it with pragmas. 2. I have increased constants to reduce chance of noise affecting results: - static const int num_trials=10; - static const milliseconds min_time_per_trial(10); + static const int num_trials=20; + static const milliseconds min_time_per_trial(50); 3. I did this to make tables more aligned: - "<table>\n" + "<table style=\"font-family: monospace\">\n" 4. In terms of benchmark setup I would add 5% of "opposite" lookups(e.g. success in failures) since I presume current setup does not penalize branchy code as realistic scenarios would(although it is possible real code might also might have close to 100% of successes or failures). Just to be clear: I did not make this change. 5. I would suggest to to consider switching benchmark repo to use native instead of mavx2

So, unrolling does not happen, this is out of the way, thanks for investigating. I'll use -native as you suggest. As for the difference between the original hash production scheme and the one proposed by Kostas (cells marked with *), numbers are not very conclusive, but looks like Kostas's approach incurs a slight degradation in execution time. I hope we can see this more clearly with the upcoming GHA benchmarks on dedicated machines. Joaquin M Lopez Munoz

Ivan Matek

4:39 a.m.

On Mon, Jun 9, 2025 at 10:31 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

I'll use -native as you suggest. As for the difference between the original hash production scheme and the one proposed by Kostas (cells marked with *), numbers are not very conclusive, but looks like Kostas's approach incurs a slight degradation in execution time. I hope we can see this more clearly with the upcoming GHA benchmarks on dedicated machines.

btw one more thing that I noticed, but did not investigate in detail: gcc 14 numbers are much much worse than clang 20 on my machine, but just for some values in tables, while others are super close. I would not bother you for tiny performance difference, but this is huge. At first I thought it is just I am doing something wrong, i.e. number of elements, but other numbers match closely. I do not know nice way to show you ratio of numbers from 2 HTML documents, but so you can search for 13.71 and 13.64 in result_1M_num_release_withoutpragma_gcc_run0.html , you will see the slow values. This is not one off discrepancy, i.e. on multiple runs I get large difference although it does fluctuate a bit.

Joaquin M López Muñoz

5:06 a.m.

El 09/06/2025 a las 12:39, Ivan Matek escribió:

...

On Mon, Jun 9, 2025 at 10:31 AM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

I'll use -native as you suggest. As for the difference between the original hash production scheme and the one proposed by Kostas (cells marked with *), numbers are not very conclusive, but looks like Kostas's approach incurs a slight degradation in execution time. I hope we can see this more clearly with the upcoming GHA benchmarks on dedicated machines.

btw one more thing that I noticed, but did not investigate in detail: gcc 14 numbers are much much worse than clang 20 on my machine, but just for some values in tables, while others are super close. I would not bother you for tiny performance difference, but this is huge.

At first I thought it is just I am doing something wrong, i.e. number of elements, but other numbers match closely. I do not know nice way to show you ratio of numbers from 2 HTML documents, but so you can search for 13.71 and 13.64 in result_1M_num_release_withoutpragma_gcc_run0.html , you will see the slow values.

This is not one off discrepancy, i.e. on multiple runs I get large difference although it does fluctuate a bit.

I see... well, who knows. FWIW I run the test locally with GCC 13.2 on MSYS2 and these anomalies don't show up. It may be a codegen issue, if you're keen on investigating this further try reducing the number of filter configurations being tested (for instance, remove all rows except #3 and #4). Joaquin M Lopez Munoz

Ivan Matek

10 Jun 10 Jun

6:52 a.m.

One more thing I noticed/checked and I believe may be of interest, again not in particular for this proposed Knuth multiplier change, but in general. Some filters have much weaker performance for mixed lookup. To be honest I have expected that for some(branch misprediction) but I do not know enough about all filters to say if slowdowns I have observed are as expected. Interesting that some fast_ have bad performance for mixed lookup, although naively I would assume SIMD code is branchless. I presume the issue is that sometimes we operate on more than 256 bits so there is branching... What I find interesting beside this slowdown is that unordered_flat_set also has bad performance for mixed lookup so bloom filters that do not degrade on mixed lookup look much better than if we just compare 100%/0% cases. For example if we round performance to 1 decimal digit and compare unordered_flat set 4.7/3.1/*9.4* vs filter<int, 1ul, block<unsigned long, 7ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> 3.3/3.3/*3.3* comparison looks much much different if we drop mixed performance(bolded part). Here is output my small test run for 1M items, I have also attached source if somebody wants to run it on his machine, it is modified benchmark code. Note that I did not add mixed numbers to html output, they are printed on cerr for cases when slowdown is detected. usual disclaimer: my machine, only 1 compiler, etc. Slow mixed lookup unordered_flat_set_filter<int> FPR 0 insertion_time 17.5341 successful_lookup_time 4.68546 unsuccessful_lookup_time 3.10828 mixed_lookup_time 9.3794 Slow mixed lookup filter<int, 1ul, fast_multiblock32<11ul>, 0ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.1169 insertion_time 3.13323 successful_lookup_time 3.11569 unsuccessful_lookup_time 2.1553 mixed_lookup_time 9.8229 Slow mixed lookup filter<int, 1ul, fast_multiblock32<13ul>, 0ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.028 insertion_time 3.35237 successful_lookup_time 3.57547 unsuccessful_lookup_time 2.49646 mixed_lookup_time 10.173 Slow mixed lookup filter<int, 1ul, fast_multiblock32<11ul>, 0ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.1201 insertion_time 3.19758 successful_lookup_time 3.23826 unsuccessful_lookup_time 2.24502 mixed_lookup_time 9.93281 Slow mixed lookup filter<int, 1ul, fast_multiblock32<13ul>, 0ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0319 insertion_time 3.32857 successful_lookup_time 3.33717 unsuccessful_lookup_time 2.60953 mixed_lookup_time 10.5385 Slow mixed lookup filter<int, 1ul, fast_multiblock32<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.084 insertion_time 3.33082 successful_lookup_time 3.55582 unsuccessful_lookup_time 2.28511 mixed_lookup_time 9.61687 Slow mixed lookup filter<int, 1ul, fast_multiblock64<11ul>, 0ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.0761 insertion_time 4.61749 successful_lookup_time 4.8703 unsuccessful_lookup_time 3.57926 mixed_lookup_time 11.4899 Slow mixed lookup filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.0629 insertion_time 4.47389 successful_lookup_time 4.73559 unsuccessful_lookup_time 3.62872 mixed_lookup_time 11.0488 Slow mixed lookup filter<int, 1ul, fast_multiblock32<13ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.0183 insertion_time 3.49621 successful_lookup_time 3.31442 unsuccessful_lookup_time 2.77045 mixed_lookup_time 9.50943 Slow mixed lookup filter<int, 1ul, fast_multiblock64<13ul>, 0ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.0154 insertion_time 5.32834 successful_lookup_time 6.67898 unsuccessful_lookup_time 4.46843 mixed_lookup_time 12.1819 Slow mixed lookup filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> FPR 0.0121 insertion_time 6.06002 successful_lookup_time 6.80858 unsuccessful_lookup_time 4.30832 mixed_lookup_time 12.2018 Slow mixed lookup filter<int, 1ul, fast_multiblock32<11ul>, 1ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0914 insertion_time 3.14643 successful_lookup_time 3.08782 unsuccessful_lookup_time 2.14325 mixed_lookup_time 9.40797 Slow mixed lookup filter<int, 1ul, fast_multiblock64<11ul>, 0ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0757 insertion_time 4.24599 successful_lookup_time 4.46389 unsuccessful_lookup_time 3.33413 mixed_lookup_time 11.2218 Slow mixed lookup filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0638 insertion_time 4.29631 successful_lookup_time 4.46474 unsuccessful_lookup_time 3.33839 mixed_lookup_time 11.0688 Slow mixed lookup filter<int, 1ul, fast_multiblock32<13ul>, 1ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0191 insertion_time 3.28589 successful_lookup_time 3.28611 unsuccessful_lookup_time 2.50493 mixed_lookup_time 9.61269 Slow mixed lookup filter<int, 1ul, fast_multiblock64<13ul>, 0ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0145 insertion_time 5.298 successful_lookup_time 5.89132 unsuccessful_lookup_time 5.36365 mixed_lookup_time 15.9068 Slow mixed lookup filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, fastrange_and_fixed_mcg> FPR 0.0125 insertion_time 6.79997 successful_lookup_time 6.54907 unsuccessful_lookup_time 5.04036 mixed_lookup_time 13.1906

Joaquin M López Muñoz

11 Jun 11 Jun

10:33 a.m.

El 10/06/2025 a las 14:52, Ivan Matek escribió:

...

One more thing I noticed/checked and I believe may be of interest, again not in particular for this proposed Knuth multiplier change, but in general.

Some filters have much weaker performance for mixed lookup. To be honest I have expected that for some(branch misprediction) but I do not know enough about all filters to say if slowdowns I have observed are as expected. Interesting that some fast_ have bad performance for mixed lookup, although naively I would assume SIMD code is branchless. I presume the issue is that sometimes we operate on more than 256 bits so there is branching...

fast_multiblock* has branches (early termination on lookup) at k = 8, 16, ..., that is, whenever a full subblock (a pair of __m128is, a __m256i, a pair of __m256is or a uint32x4x2_t, depending on the subfilter and the SIMD technology) has been processed.

...

What I find interesting beside this slowdown is that unordered_flat_set also has bad performance for mixed lookup so bloom filters that do not degrade on mixed lookup look much better than if we just compare 100%/0% cases. For example if we round performance to 1 decimal digit and compare unordered_flat set 4.7/3.1/*9.4* vs filter<int, 1ul, block<unsigned long, 7ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> 3.3/3.3/*3.3*

comparison looks much much different if we drop mixed performance(bolded part).

Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K. Joaquin M Lopez Munoz

Ivan Matek

12 Jun 12 Jun

5:15 a.m.

On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K.

Can you elaborate why would you want to change multiblock?

To save memory accesses that are unnecessary and slow down rest of the program because they occupy cache or purely to reduce number of instructions executed? If that is the case on my CPU it does not seem the matter a lot, i.e. changing AVX2 to be branchless does not help or hurt performance, maybe CPU anyway decides to fetch data... This is what I did to make avx code branchless, did not affect performance in a way I could detect: static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { bool found = true; for(int i=0;i<k/8;++i){ found = found && check_m256ix2(x[i],hash,8); hash=detail::mulx64(hash); } if constexpr (k%8){ found = found && check_m256ix2(x[k/8],hash,k%8); } return found; } But back to the multiblock and making it branchful: With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) it outperforms fast_multiblock for L2, L3, RAM size filters when it comes to mixed lookups, so I am not certain it needs changing. Now for unsuccessful lookups fast_multiblock wins a little, but I am not sure for all use cases unsuccessful lookups will be close 100%. Maybe best option is to allow users to configure if filter is branchy or not, but with so many filters, not sure if another config option is best, as it may confuse users. In any case here are numbers for 2 similar filters on my machine: 1M, 3M, 35M(I have lot of L3 cache so I needed large number of values to go over L3). *CLANG* *1M/L2* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0581 insertion_time 4.30731 successful_lookup_time 4.62453 unsuccessful_lookup_time 4.64252 mixed_lookup_time 4.6812 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0629 insertion_time 4.33433 successful_lookup_time 4.47549 unsuccessful_lookup_time 3.46791 mixed_lookup_time 11.0634 *3M/L3* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 7.01126 successful_lookup_time 6.40535 unsuccessful_lookup_time 6.30024 mixed_lookup_time 6.20703 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0632 insertion_time 7.17907 successful_lookup_time 6.13649 unsuccessful_lookup_time 6.17347 mixed_lookup_time 12.0361 *35M/RAM* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 19.3996 successful_lookup_time 18.6527 unsuccessful_lookup_time 18.1346 mixed_lookup_time 18.2745 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 18.9966 successful_lookup_time 18.729 unsuccessful_lookup_time 16.1566 mixed_lookup_time 21.5246 in case somebody is curious GCC 14 numbers: *GCC* *1M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0581 insertion_time 13.0513 successful_lookup_time 8.06424 unsuccessful_lookup_time 8.09518 mixed_lookup_time 8.21242 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0629 insertion_time 5.68284 successful_lookup_time 5.7248 unsuccessful_lookup_time 4.32124 mixed_lookup_time 11.7601 *3M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 14.0387 successful_lookup_time 9.44954 unsuccessful_lookup_time 9.3228 mixed_lookup_time 9.41488 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0632 insertion_time 6.97644 successful_lookup_time 6.69201 unsuccessful_lookup_time 5.9029 mixed_lookup_time 12.6044 *35M* filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 26.4349 successful_lookup_time 32.8604 unsuccessful_lookup_time 33.0485 mixed_lookup_time 33.0334 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 20.6438 successful_lookup_time 18.1056 unsuccessful_lookup_time 16.2122 mixed_lookup_time 23.2072 // in some cases over 50% slower than clang What I find quite interesting is that if you only benchmark with clang or only with gcc you will get totally different conclusions about which filter is faster, not to mention that before measuring I would assume that for 35M case memory latency dominates so differences in optimization will not matter... For example: 35M clang: mixed_lookup_time 18.2745 vs mixed_lookup_time 21.5246 gcc: mixed_lookup_time 33.0334 vs mixed_lookup_time 23.2072 P.S. I am obviously comparing clang and gcc for just 1 benchmark, please do not extrapolate this to general optimizer performance.

Joaquin M López Muñoz

12:56 p.m.

El 12/06/2025 a las 13:15, Ivan Matek escribió:

...

On Wed, Jun 11, 2025 at 6:33 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

Unlike fast_multiblock*, [multi]block is branchless, so your results make sense. I wonder if multiblock<K> should be made branchful as the working memory area can get quite large for large values of K.

Can you elaborate why would you want to change multiblock? To save memory accesses that are unnecessary and slow down rest of the program because they occupy cache or purely to reduce number of instructions executed?

Well, both. Consider the following cost model: * Succesful lookup executes in A*K cycles for some constant factor A. * Branchless unsuccesful lookup is also A*K, exactly the same as succesful lookup. * Branchful unsuccesful lookup when the array is half full executes in B*(1*0.5 + 2*0.25 + 3*0.125 + ... + K*2^-K-1) cycles . Since the sum decays exponentially, this is roughly equivalent to some constant C irrespective of K. Now, the thing is that for a sufficiently large K, A*K (branchless) is going to be larger than C (branchful). The best approach would be then to have a hybrid algorithm that branches every L operations. Determining such a value is an interesting experiment.

...

If that is the case on my CPU it does not seem the matter a lot, i.e. changing AVX2 to be branchless does not help or hurt performance, maybe CPU anyway decides to fetch data...

This is what I did to make avx code branchless, did not affect performance in a way I could detect: [...] found = found && check_m256ix2(x[i],hash,8);

This is short circuiting (at least in theory). You may try using bitwise AND (&). Also, try with large values of K: you should reach a point where branchless is worse.

...

But back to the multiblock and making it branchful: With my CPU/compiler(clang is 🚀 for this benchmark compared to gcc) it outperforms fast_multiblock for L2, L3, RAM size filters when it comes to mixed lookups, so I am not certain it needs changing.

Yes, it's hard to decide based on these numbers. For branchless lookup, mixed lookup obviously takes the same as successful or unsuccesful lookup. For branchful lookup, a cost model for mixed lookup would be P*M*(A*K) + (1-P)*M*C where P is the percentange of successful lookups and M is an extra cost factor on top of the non-mixed approach due to branch mispredictions. Again, an interesting area to investigate. You're giving me a lot of work :-) I'd rather release the lib now in time for Boost 1.89 and then postpone this analysis for 1.90, as it's going to take some work. Joaquin M Lopez Munoz

Ivan Matek

13 Jun 13 Jun

1:23 a.m.

On Thu, Jun 12, 2025 at 8:56 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

This is short circuiting (at least in theory). You may try using bitwise AND (&). Also,

Doh. :) Thank you for catching that. On my machine when I do that difference is miniscule/noise. I have bumped up now K to 14, so numbers are not directly comparable to previous numbers, but most important thing is that there is no difference(for Clang). Clang *1M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0737 insertion_time 4.97109 successful_lookup_time 5.75406 unsuccessful_lookup_time 5.73877 mixed_lookup_time 5.77369 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0736 insertion_time 5.19262 successful_lookup_time 5.8273 unsuccessful_lookup_time 5.81782 mixed_lookup_time 5.81926 *35M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0752343 insertion_time 19.7033 successful_lookup_time 19.8497 unsuccessful_lookup_time 20.9164 mixed_lookup_time 20.6151 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0748914 insertion_time 19.734 successful_lookup_time 20.3514 unsuccessful_lookup_time 19.5756 mixed_lookup_time 19.375 as before GCC performance for multiblock is bad, so now SIMD code easily wins GCC *1M*filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0737 insertion_time 11.846 successful_lookup_time 9.76772 unsuccessful_lookup_time 9.80503 mixed_lookup_time 9.81788 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 1.90735 FPR 0.0736 insertion_time 6.17691 successful_lookup_time 6.43346 unsuccessful_lookup_time 6.45574 mixed_lookup_time 6.4289 *35M* filter filter<int, 1ul, multiblock<unsigned long, 14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0752343 insertion_time 30.3168 successful_lookup_time 36.3754 unsuccessful_lookup_time 37.8927 mixed_lookup_time 38.045 filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0748914 insertion_time 21.2682 successful_lookup_time 21.2692 unsuccessful_lookup_time 21.0058 mixed_lookup_time 19.7536 As before unsuccessful lookups get worse slightly compared to branchful version, as expected, since much more extra work is done. SIMD code branchful vs branchless for 3M(Clang): successful_lookup_time 6.5231 unsuccessful_lookup_time 5.71519 mixed_lookup_time 12.8607 insertion_time 6.44847 successful_lookup_time 6.68598 unsuccessful_lookup_time 6.66759 mixed_lookup_time 6.71642

...

Again, an interesting area to investigate. You're giving me a lot of work :-)

Too hard work for me to help with :), but one speculative idea: it may be reasonable to prepare lookups(i.e. make_m256ix2(hash,kp); ) for all hashes unconditionally, but still have branchy code in terms of breaking when first _mm256_testc_si256 detects no match. Of simpler tasks: if you want I can try to make pull request for adding mixed results to benchmark. Downside is that benchmark already has a lot of columns and rows so users might find it confusing. On the other hand I think mixed benchmark is valuable info for people who will not have 90% + of hits or misses in their usages.

...

I'd rather release the lib now in time for Boost 1.89 and then postpone this analysis for 1.90, as it's going to take some work.

I agree, I never meant to imply this is critical fix needed situation, I just found it quite interesting. for the record this is check now: static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { bool found = true; for(int i=0;i<k/8;++i){ found &= check_m256ix2(x[i],hash,8); hash=detail::mulx64(hash); } if constexpr (k%8){ found &= check_m256ix2(x[k/8],hash,k%8); } return found; }

Ivan Matek

14 Jun 14 Jun

11:32 a.m.

One more datapoint that confirms what you suggested(and what intuitively makes sense), that for large K effect of breaking early for unsuccessful lookups looks more impressive(when filter fits in L2). Here is K == 19 results. I picked K == 19 because it is bigger than 16, no other reason, maybe it is silly number to use for real bloom filter. Clang results only. *1M* branchless: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 2.38419 FPR 0.0174 insertion_time 6.89076 successful_lookup_time 7.72433 unsuccessful_lookup_time 7.76004 mixed_lookup_time 7.78691 branchful: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 2.38419 FPR 0.0174 insertion_time 6.808 successful_lookup_time 7.09307 unsuccessful_lookup_time 4.9642 mixed_lookup_time 12.8065 For L3 sizes branchless code performs much better relatively even though branchy code still wins for unsuccessful, I would assume because of large memory latencies. *3M* branchless: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 7.15256 FPR 0.0153667 insertion_time 10.1734 successful_lookup_time 9.1581 unsuccessful_lookup_time 9.0949 mixed_lookup_time 9.10151 branchful: filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 7.15256 FPR 0.0153667 insertion_time 8.9318 successful_lookup_time 8.79925 unsuccessful_lookup_time 8.34029 mixed_lookup_time 14.4601 There is also another interesting fact about performance that will take a bit to explain so I ask for patience: fast_multiblock<8> is much faster than fast_block<11> for unsuccessful_lookup_time, this does not initially make sense because we would expect we execute same amount of instructions, and branch is easily predictable since 100% of lookups fail(beside FPR). I am not asm expert, but it seems clang interleaves SIMD for for 11 case so some stuff that is not strictly necessary before branch is done before branch. Now this is great when we do not branch(early return) but is waste when we have close to 100% unsuccessful lookups. This made me investigate a bit and on my machine I can get really great performance for unsuccessful_lookup_time when I combine SIMD and scalar filter for "remainder"(e.g. SIMD 8 , scalar 3 for K==11). This combined filter(bobw_filter) has really great performance for unsuccessful lookups, unclear if this is just microbenchmark anomaly(if there was some extra code as in real application maybe extra simd code for fast_multiblock64<11ul> being executed would not matter). *3M* filter bobw_filter<11ul> // fast_multiblock64<8> + multiblock<3> capacity MB 5.72205 FPR 0.0555667 insertion_time 7.7845 successful_lookup_time 6.48931 unsuccessful_lookup_time 3.45824 mixed_lookup_time 12.2271 filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 FPR 0.0609667 insertion_time 6.6754 successful_lookup_time 6.161 unsuccessful_lookup_time 6.19519 mixed_lookup_time 6.09886 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> // branchful capacity MB 5.72205 FPR 0.0632 insertion_time 6.33386 successful_lookup_time 5.87341 unsuccessful_lookup_time 5.86074 mixed_lookup_time 12.0457 I would not bother you for small difference, but 3.46 vs 6.20 and 5.86 looks quite impressive, with disclaimer I mentioned before that it may be just tight microbenchmark difference, and this is just for cases when people are purely focused on unsuccessful lookups. Advantage holds even for large number of elements for unsuccessful lookup, but others look quite bad relatively *35M* filter bobw_filter<11ul> capacity MB 66.7572 FPR 0.0569229 insertion_time 23.9595 successful_lookup_time 22.1188 unsuccessful_lookup_time 10.6815 mixed_lookup_time 24.646 filter filter<int, 1ul, multiblock<unsigned long, 11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.0635571 insertion_time 19.1072 successful_lookup_time 18.0308 unsuccessful_lookup_time 18.0789 mixed_lookup_time 18.0465 filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 66.7572 FPR 0.06368 insertion_time 18.6431 successful_lookup_time 16.2872 unsuccessful_lookup_time 14.8029 mixed_lookup_time 21.1569 I understand that mixing 2 filters instead of modifying one to have mix of simd and scalar code is not really greatest comparison, but it was much easier for me to make combined filter, than to figure out how to mix scalar and simd filter code in one filter. template<std::size_t K> struct bobw_filter { static constexpr bool has_scalar = K%4; bobw_filter() = default; static constexpr size_t scalar_capacity(size_t full_capacity) { return full_capacity*(K%4)/K; } static constexpr size_t simd_capacity(const size_t full_capacity) { return full_capacity*(K/4*4)/K; } explicit bobw_filter(size_t capacity) : simd_f(simd_capacity(capacity)) { if constexpr(has_scalar) { assert(scalar_capacity(capacity)!=0); scalar_f.reset(scalar_capacity(capacity)); } } // TODO: is this mulx64 correct thing to do, or we should apply it on hash of x? BOOST_FORCEINLINE void insert(const int& x) { simd_f.insert(x); if constexpr(has_scalar) { scalar_f.insert(detail::mulx64(x)); } } BOOST_FORCEINLINE bool may_contain(const int& x) { if constexpr(has_scalar) { return simd_f.may_contain(x) && scalar_f.may_contain(detail::mulx64(x)); } else { return simd_f.may_contain(x); } } auto capacity() { if constexpr(has_scalar) { return scalar_f.capacity() + simd_f.capacity(); } else { return simd_f.capacity(); } } [[no_unique_address]] std::conditional_t<has_scalar, filter<int, 1, multiblock<boost::uint64_t, K % 4>, 1 >, std::monostate> scalar_f; filter<int, 1, fast_multiblock64<K / 4 * 4>, 1> simd_f; using value_type = typename decltype(simd_f)::value_type; };

Joaquin M López Muñoz

15 Jun 15 Jun

10:44 a.m.

El 14/06/2025 a las 19:32, Ivan Matek escribió:

...

One more datapoint that confirms what you suggested(and what intuitively makes sense), that for large K effect of breaking early for unsuccessful lookups looks more impressive(when filter fits in L2).

[...]

There is also another interesting fact about performance that will take a bit to explain so I ask for patience: fast_multiblock<8> is much faster than fast_block<11> for unsuccessful_lookup_time, this does not initially make sense because we would expect we execute same amount of instructions, and branch is easily predictable since 100% of lookups fail(beside FPR). I am not asm expert, but it seems clang interleaves SIMD for for 11 case so some stuff that is not strictly necessary before branch is done before branch. Now this is great when we do not branch(early return) but is waste when we have close to 100% unsuccessful lookups.

Yes, this is odd. If your thesis is correct, the budget optimizer of the compiler seems to have decided to go branchless despite what the code says.

...

This made me investigate a bit and on my machine I can get really great performance for unsuccessful_lookup_time when I combine SIMD and scalar filter for "remainder" (e.g. SIMD 8 , scalar 3 for K==11).

This is an idea that deserves some investigation. When K%8 is small, going scalar could actually be faster than SIMD. This is what your experiments seem to suggest anyway.

...

This combined filter(bobw_filter) has really great performance for unsuccessful lookups[...]

[...]

I understand that mixing 2 filters instead of modifying one to have mix of simd and scalar code is not really greatest comparison, but it was much easier for me to make combined filter, than to figure out how to mix scalar and simd filter code in one filter.

If you want to keep on playing, please find attached a sketch of a bobw_multiblock64_avx2 filter for your enjoyment. (Why the "bobw" preffix btw?) Joaquin M Lopez Munoz

Ivan Matek

7:27 p.m.

On Sun, Jun 15, 2025 at 6:45 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

El 14/06/2025 a las 19:32, Ivan Matek escribió: Yes, this is odd. If your thesis is correct, the budget optimizer of the compiler seems to have decided to go branchless despite what the code says.

I was wrong, just checked with VTune asm view and branch is definitely there(if I can read asm correctly), this is for K==11, unmodified(branchful) avx2 code. I will need to investigate further to respond to other comments you made, just wanted to say asap that this speculation of mine was wrong. I now believe prefetch is the cause of difference in performance, as you can see for K==11 we have 3 prefetches and that is why we suffer for unsuccessful lookup case. Address Source Line Assembly CPU Time Instructions Retired 0x6740 0 Block 7: 0x6740 0 xor %edx, %edx 0x6742 246 add %rdx, %rax 0x6745 0 add $0x4, %rcx 0x6749 0 cmp %rsi, %rcx 0x674c 246 jz 0x681c <Block 10> 0x6752 0 Block 8: 0x6752 0 movsxdl (%rcx), %rdx 0usec 117,450,000 0x6755 0 mulx %r12, %r9, %rdx 0usec 143,550,000 0x675a 0 xor %r9, %rdx 0usec 136,300,000 0x675d 0 or $0x1, %rdx 0usec 118,900,000 0x6761 0 mulx %rdi, %rdx, %r9 0usec 130,790,000 *0x6766 0 prefetcht0z (%r8,%r9,1) 0usec 111,360,0000x676b 0 prefetcht0z 0x40(%r8,%r9,1) 0usec 147,320,0000x6771 0 prefetcht0z 0x80(%r8,%r9,1) 0usec 132,820,000* 0x677a 0 vmovdquy (%r8,%r9,1), %ymm0 198.739usec 109,910,000 0x6780 0 vmovdquy 0x20(%r8,%r9,1), %ymm1 0usec 134,270,000 0x6787 0 vmovq %rdx, %xmm2 99.370usec 128,180,000 0x678c 0 vpbroadcastq %xmm2, %ymm2 0usec 122,670,000 0x6791 0 vpsllvq %ymm4, %ymm2, %ymm2 0usec 133,980,000 0x6796 0 vpsrld $0x1a, %ymm2, %ymm2 99.370usec 100,920,000 0x679b 0 vpmovzxdq %xmm2, %ymm3 0usec 163,850,000 0x67a0 0 vpsllvq %ymm3, %ymm5, %ymm3 0usec 117,450,000 0x67a5 0 vextracti128 $0x1, %ymm2, %xmm2 0usec 130,790,000 0x67ab 0 vpmovzxdq %xmm2, %ymm2 0usec 139,200,000 0x67b0 0 vptest %ymm3, %ymm0 0usec 97,440,000 0x67b5 0 setb %r10b 0usec 175,450,000 0x67b9 0 vpsllvq %ymm2, %ymm5, %ymm0 0usec 126,150,000 0x67be 0 vptest %ymm0, %ymm1 0usec 120,060,000 0x67c3 0 setb %r11b 0usec 122,960,000 0x67c7 0 test %r10b, %r11b 99.370usec 258,390,000 *0x67ca 0 jz 0x6740 <Block 7> 99.370usec 0*0x67d0 0 Block 9: 0x67d0 0 add %r8, %r9 0usec 118,610,000 0x67d3 0 add $0x40, %r9 0usec 124,120,000 0x67d7 0 vmovdquy (%r9), %ymm0 0usec 131,660,000 0x67dc 0 mulx %r12, %rdx, %r9 0usec 131,080,000 0x67e1 0 xor %rdx, %r9 0usec 112,230,000 0x67e4 0 vmovq %r9, %xmm1 0usec 140,650,000 0x67e9 0 vpbroadcastq %xmm1, %xmm1 0usec 124,410,000 0x67ee 0 vpsllvq %xmm6, %xmm1, %xmm1 0usec 136,590,000 0x67f3 0 vpsrld $0x1a, %xmm1, %xmm1 0usec 109,910,000 0x67f8 0 vpmovzxdq %xmm1, %ymm1 0usec 134,850,000 0x67fd 0 vpsllvq %ymm1, %ymm7, %ymm1 0usec 118,030,000 0x6802 0 xor %edx, %edx 0usec 103,530,000 0x6804 0 vptest %ymm1, %ymm0 0usec 168,780,000 0x6809 0 setb %dl 0usec 96,860,000 0x680c 246 add %rdx, %rax 0usec 143,550,000 0x680f 0 add $0x4, %rcx 0usec 140,940,000 0x6813 0 cmp %rsi, %rcx 0x6816 246 jnz 0x6752 <Block 8> 0usec 285,360,000 If we prefetch 50% more data(K==8 prefetches only 2 cache lines, K==11 prefetches 3 cache lines), and we never use it(when we are benchmarking unsucessful lookup) performance suffers. So if I just change one line (commenting out prefetch) unsuccessful lookups performance changes drastically(presumably because 3M is in the area where we are over L2 cache size of my CPU) *3M* with prefetch filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.07411 successful_lookup_time 5.62967 *unsuccessful_lookup_time 5.64557*mixed_lookup_time 11.9758 without prefetch const unsigned char* next_element(boost::uint64_t& h)const noexcept { auto p=ar.buckets+hs.next_position(h)*bucket_size; for(std::size_t i=0;i<prefetched_cachelines;++i){ //BOOST_BLOOM_PREFETCH((unsigned char*)p+i*cacheline); } return p; } filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.02206 successful_lookup_time 5.03205 *unsuccessful_lookup_time 3.58734*mixed_lookup_time 11.3664 For 1M there is no noticable difference between prefetch and no prefetch(as all data fits in L2), to confirm this is all about memory I tried 35M and here again commenting out prefetch helps with unsuccessful lookups. *35M* with prefetch: filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 58.4126 FPR 0.160926 insertion_time 17.7846 successful_lookup_time 15.6446 *unsuccessful_lookup_time 14.7353*mixed_lookup_time 21.5229 without prefetch: filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 58.4126 FPR 0.160926 insertion_time 18.0509 successful_lookup_time 13.8877 *unsuccessful_lookup_time 11.4105*mixed_lookup_time 19.807 In the end this is kind of obvious, I was so focused on codegen that I forgot obvious things about caching. Still I do not think all this measurements were useless: I can imagine a world in which you expect most of lookups to be unsuccessful and then you disable prefetch except maybe for initial block (that is always needed).

Kostas Savvidis

13 Jun 13 Jun

6:47 a.m.

...

On 12 Jun 2025, at 21:56, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

For branchful lookup, a cost model for mixed lookup would be

P*M*(A*K) + (1-P)*M*C

where P is the percentange of successful lookups and M is an extra cost factor on top of the non-mixed approach due to branch mispredictions.

I agree this effect will affect measurements to the point that you will never be able to be 100% sure. Lets just agree on the obvious consensus: an extra 64bit multiplication cannot affect the running time of any program on modern CPUs. The multiplier is from THE ART OF COMPUTER PROGRAMMING by D. Knuth, volume II, page 108. Regards, Kostas

Joaquin M López Muñoz

16 Jun 16 Jun

2:57 a.m.

El 13/06/2025 a las 14:47, Kostas Savvidis via Boost escribió:

...

...
On 12 Jun 2025, at 21:56, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

For branchful lookup, a cost model for mixed lookup would be

P*M*(A*K) + (1-P)*M*C

where P is the percentange of successful lookups and M is an extra cost factor on top of the non-mixed approach due to branch mispredictions.

I agree this effect will affect measurements to the point that you will never be able to be 100% sure. Lets just agree on the obvious consensus: an extra 64bit multiplication cannot affect the running time of any program on modern CPUs.

I finally implemented the change here: https://github.com/joaquintides/bloom/commit/b28e526b9e2994d5470adb6ae502481... There's a measureable effect, at least on microbenchmarks, compare results without * (before) and with * (after): https://github.com/boostorg/boost_bloom_benchmarks/blob/alternative-hash-pro... But, well, it is what it is.

...

The multiplier is from THE ART OF COMPUTER PROGRAMMING by D. Knuth, volume II, page 108.

I've settled for two multipliers (one for 64-bit and another for 32-bit mode) from Steele and Vigna: https://arxiv.org/pdf/2001.05304 Joaquin M Lopez Munoz

Peter Dimov

6 Jun 6 Jun

5:33 p.m.

Joaquin M López Muñoz wrote:

...

Hi,

As per feedback given here:

https://lists.boost.org/Archives/boost//2025/05/259548.php

I've experimented with the hash production scheme suggested by Kostas, where h_{i+1} is calculated as h_i * 6364136223846793005 mod 2^64:

Where does this multiplier come from?

Joaquin M López Muñoz

7 Jun 7 Jun

1:40 a.m.

El 07/06/2025 a las 1:33, Peter Dimov via Boost escribió:

...

Joaquin M López Muñoz wrote:

...
Hi,

As per feedback given here:

https://lists.boost.org/Archives/boost//2025/05/259548.php

I've experimented with the hash production scheme suggested by Kostas, where h_{i+1} is calculated as h_i * 6364136223846793005 mod 2^64: Where does this multiplier come from?

This is the constant suggested by Kostas. A search on the Internet suggests this is used in connection with so-called PCG algorithms. Joaquin M Lopez Munoz

Ivan Matek

2:23 a.m.

On Sat, Jun 7, 2025 at 9:40 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

This is the constant suggested by Kostas. A search on the Internet suggests this is used in connection with so-called PCG algorithms.

I did a bit of digging, could not find anything interesting in a sense it

proves why it is "good". Knuth provides no info, beside crediting person who gave him the constant. This 2010 paper is a bit better, but still not great: https://icl.utk.edu/users/luszczek/pubs/rarngicl.pdf *By choosing a = 6364136223846793005 (=3×5×415949×1020018675983), c = 1, and m = 2^64 we may obtain a sequence with a period 2^64.The same RNG is used in the current implementation of the High Performance LINPACK (HPL) benchmark*

Peter Dimov

2:32 a.m.

Joaquin M López Muñoz wrote:

...

El 07/06/2025 a las 1:33, Peter Dimov via Boost escribió:

...
Joaquin M López Muñoz wrote:

...
Hi,

As per feedback given here:

https://lists.boost.org/Archives/boost//2025/05/259548.php

I've experimented with the hash production scheme suggested by Kostas, where h_{i+1} is calculated as h_i * 6364136223846793005 mod 2^64: Where does this multiplier come from?

This is the constant suggested by Kostas. A search on the Internet suggests this is used in connection with so-called PCG algorithms.

Wikipedia lists it as "MMIX by Donald Knuth", 6364136223846793005 * x + 1442695040888963407 mod 2^64. This is an LCG multiplier (used with b != 0), so it's not clear whether it would be "good" for an MCG (b == 0). For multipliers my go-to reference is Vigna and Steele, https://arxiv.org/pdf/2001.05304. For MCG mod 2^64, it gives 0xf1357aea2e62a9c5 as the "good multiplier".

Joaquin M López Muñoz

3:05 a.m.

El 07/06/2025 a las 10:32, Peter Dimov via Boost escribió:

...

Joaquin M López Muñoz wrote:

...
El 07/06/2025 a las 1:33, Peter Dimov via Boost escribió:

...
Joaquin M López Muñoz wrote:

...
Hi,

As per feedback given here:

https://lists.boost.org/Archives/boost//2025/05/259548.php

I've experimented with the hash production scheme suggested by Kostas, where h_{i+1} is calculated as h_i * 6364136223846793005 mod 2^64: Where does this multiplier come from? This is the constant suggested by Kostas. A search on the Internet suggests this is used in connection with so-called PCG algorithms. Wikipedia lists it as "MMIX by Donald Knuth", 6364136223846793005 * x + 1442695040888963407 mod 2^64.

This is an LCG multiplier (used with b != 0), so it's not clear whether it would be "good" for an MCG (b == 0).

For multipliers my go-to reference is Vigna and Steele,https://arxiv.org/pdf/2001.05304.

For MCG mod 2^64, it gives 0xf1357aea2e62a9c5 as the "good multiplier".

I can use that, but I'm currently more concerned about the potential impact on performance, which is allegedly independent of the particular constant chosen. Joaquin M Lopez Munoz

Ivan Matek

17 Jun 17 Jun

12:51 p.m.

On Tue, Jun 17, 2025 at 3:04 PM Ivan Matek <libbooze@gmail.com> wrote:

...

I may have mixed up something in my computations, please feel free to correct me.

Or I can correct myself :) This was the issue, to quote documentation: *When the default value 0 is used, buckets have the same size as the subarrays subfilters operate on (non-overlapping case). Otherwise, bucket size is smaller and subarrays spill over adjacent buckets, which results in an improved (lower) FPR in exchange for a possibly worse performance due to memory unalignment.*In other words even though conceptually block had 11 uint64_t, they are not stored with alignment of uint64_t. But still it may be worthwhile to still investigate how aggressive we want to be with the prefetching. For example for TestK==3 all these blocks prefetch 2 cachelines(this is extreme example, K == 3 is silly value): filter<int, 1, multiblock<boost::uint64_t, TestK>, 1>, filter<int, 1, multiblock<boost::uint64_t, TestK>>, filter<int, 1, fast_multiblock64<TestK>>, filter<int, 1, fast_multiblock64<TestK>, 1> For TestK == 9 these prefetch 3 cachelines: filter<int, 1, multiblock<boost::uint64_t, TestK>, 1>, filter<int, 1, fast_multiblock64<TestK>>, filter<int, 1, fast_multiblock64<TestK>, 1> while only block that remains on 2 cache lines for K == 9 is: filter<int, 1, multiblock<boost::uint64_t, TestK>>, So it all makes sense considering how filters are specified(e.g. filter< fast_multiblock64<9>,1> has 72 bytes of state with start address aligned to 1 so it can stretch 3 cachelines), assuming we want to make sure for every random start position we load all the cachelines we may need, but still question remains if we really want that. So questions: 1. Should we skip prefetch of last cacheline if it is more than 50%(or some other %) likely we will not need the last cacheline? 2. If we do not want to make changes in 1. should we encourage people to use block size 64 for fast_multiblock64? Some performance numbers bellow. performance difference wrt prefetch can be large for unsuccessful lookup: here we manually shrink computed number of prefetched cachelines by 1(in this example we reduced 3 cachelines to 2) prefetch 2 cachelines filter filter<int, 1ul, fast_multiblock64<9ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.154767 insertion_time 6.14765 successful_lookup_time 5.01657 *unsuccessful_lookup_time 3.59357*mixed_lookup_time 11.4932 prefetch 3 cachelines(current prefetch setup) filter filter<int, 1ul, fast_multiblock64<9ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.154767 insertion_time 6.10838 successful_lookup_time 5.68114 *unsuccessful_lookup_time 5.50207*mixed_lookup_time 11.9192 prefetch 2 cachelines filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.01277 successful_lookup_time 4.97918 *unsuccessful_lookup_time 3.6566* mixed_lookup_time 11.5088 prefetch 3 cachelines(current prefetch setup) filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.90663 successful_lookup_time 6.14258 *unsuccessful_lookup_time 5.52233*mixed_lookup_time 11.9335 large block size hurts FPR, but makes blocks aligned so we prefetch less cache lines all data with prefetching enabled and uses default formula, so no manual adjustment(but block size 64 causes alignment so we only fetch 2 cache lines): filter filter<int, 1ul, fast_multiblock64<11ul>, *0ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00695 *FPR 0.187567*insertion_time 6.00145 successful_lookup_time 5.66594 unsuccessful_lookup_time 5.54643 mixed_lookup_time 12.294 filter filter<int, 1ul, fast_multiblock64<11ul>, *1ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 *FPR 0.159833*insertion_time 6.20854 successful_lookup_time 5.74137 unsuccessful_lookup_time 5.6112 mixed_lookup_time 12.1253 filter filter<int, 1ul, fast_multiblock64<11ul>, *64*ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00704 *FPR 0.2164*insertion_time 4.13648 successful_lookup_time 4.35427 unsuccessful_lookup_time 3.49425 mixed_lookup_time 11.3594 Huge wins for best alignment filter, except in mixed where fact we are branchful hurts us in all cases. Speedup is not limited just to AVX2 code, here we see plain multiblock(I am not sure why FPR did not suffer, I assume noise or hashing artifact): filter filter<int, 1ul, multiblock<unsigned long, 9ul>, *1ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72205 *FPR 0.068*insertion_time 7.27091 successful_lookup_time 6.08024 unsuccessful_lookup_time 5.82294 mixed_lookup_time 5.80241 filter filter<int, 1ul, multiblock<unsigned long, 9ul>,* 8ul*, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.72207 *FPR 0.0672*insertion_time 5.09056 successful_lookup_time 4.70166 unsuccessful_lookup_time 4.51264 mixed_lookup_time 4.48538 To be clear it is not just prefetching that is affected by alignment, i.e. code has tag dispatch based on if data is aligned or not in get function. Regards, Ivan

Kostas Savvidis

1:26 p.m.

...

On 12 Jun 2025, at 21:56, Joaquin M López Muñoz via Boost <boost@lists.boost.org <mailto:boost@lists.boost.org>> wrote:

For branchful lookup, a cost model for mixed lookup would be

P*M*(A*K) + (1-P)*M*C

where P is the percentange of successful lookups and M is an extra cost factor on top of the non-mixed approach due to branch mispredictions.

I agree this effect will affect measurements to the point that it will be decisive. Lets just agree on the obvious: an extra 64bit multiplication cannot affect the running time of any program on modern CPUs. The multiplier is from THE ART OF COMPUTER PROGRAMMING by D. Knuth, volume II, page 108. Regards, Kostas

Age (days ago)

Last active (days ago)

List overview

Download

21 comments

4 participants

participants (4)

Ivan Matek
Joaquin M López Muñoz
Kostas Savvidis
Peter Dimov

[bloom] Benchmarks with Knuth multiplier-based hash production

tags

participants (4)