
On Sun, Jun 15, 2025 at 6:45 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
El 14/06/2025 a las 19:32, Ivan Matek escribió: Yes, this is odd. If your thesis is correct, the budget optimizer of the compiler seems to have decided to go branchless despite what the code says.
I was wrong, just checked with VTune asm view and branch is definitely there(if I can read asm correctly), this is for K==11, unmodified(branchful) avx2 code. I will need to investigate further to respond to other comments you made, just wanted to say asap that this speculation of mine was wrong. I now believe prefetch is the cause of difference in performance, as you can see for K==11 we have 3 prefetches and that is why we suffer for unsuccessful lookup case. Address Source Line Assembly CPU Time Instructions Retired 0x6740 0 Block 7: 0x6740 0 xor %edx, %edx 0x6742 246 add %rdx, %rax 0x6745 0 add $0x4, %rcx 0x6749 0 cmp %rsi, %rcx 0x674c 246 jz 0x681c <Block 10> 0x6752 0 Block 8: 0x6752 0 movsxdl (%rcx), %rdx 0usec 117,450,000 0x6755 0 mulx %r12, %r9, %rdx 0usec 143,550,000 0x675a 0 xor %r9, %rdx 0usec 136,300,000 0x675d 0 or $0x1, %rdx 0usec 118,900,000 0x6761 0 mulx %rdi, %rdx, %r9 0usec 130,790,000 *0x6766 0 prefetcht0z (%r8,%r9,1) 0usec 111,360,0000x676b 0 prefetcht0z 0x40(%r8,%r9,1) 0usec 147,320,0000x6771 0 prefetcht0z 0x80(%r8,%r9,1) 0usec 132,820,000* 0x677a 0 vmovdquy (%r8,%r9,1), %ymm0 198.739usec 109,910,000 0x6780 0 vmovdquy 0x20(%r8,%r9,1), %ymm1 0usec 134,270,000 0x6787 0 vmovq %rdx, %xmm2 99.370usec 128,180,000 0x678c 0 vpbroadcastq %xmm2, %ymm2 0usec 122,670,000 0x6791 0 vpsllvq %ymm4, %ymm2, %ymm2 0usec 133,980,000 0x6796 0 vpsrld $0x1a, %ymm2, %ymm2 99.370usec 100,920,000 0x679b 0 vpmovzxdq %xmm2, %ymm3 0usec 163,850,000 0x67a0 0 vpsllvq %ymm3, %ymm5, %ymm3 0usec 117,450,000 0x67a5 0 vextracti128 $0x1, %ymm2, %xmm2 0usec 130,790,000 0x67ab 0 vpmovzxdq %xmm2, %ymm2 0usec 139,200,000 0x67b0 0 vptest %ymm3, %ymm0 0usec 97,440,000 0x67b5 0 setb %r10b 0usec 175,450,000 0x67b9 0 vpsllvq %ymm2, %ymm5, %ymm0 0usec 126,150,000 0x67be 0 vptest %ymm0, %ymm1 0usec 120,060,000 0x67c3 0 setb %r11b 0usec 122,960,000 0x67c7 0 test %r10b, %r11b 99.370usec 258,390,000 *0x67ca 0 jz 0x6740 <Block 7> 99.370usec 0*0x67d0 0 Block 9: 0x67d0 0 add %r8, %r9 0usec 118,610,000 0x67d3 0 add $0x40, %r9 0usec 124,120,000 0x67d7 0 vmovdquy (%r9), %ymm0 0usec 131,660,000 0x67dc 0 mulx %r12, %rdx, %r9 0usec 131,080,000 0x67e1 0 xor %rdx, %r9 0usec 112,230,000 0x67e4 0 vmovq %r9, %xmm1 0usec 140,650,000 0x67e9 0 vpbroadcastq %xmm1, %xmm1 0usec 124,410,000 0x67ee 0 vpsllvq %xmm6, %xmm1, %xmm1 0usec 136,590,000 0x67f3 0 vpsrld $0x1a, %xmm1, %xmm1 0usec 109,910,000 0x67f8 0 vpmovzxdq %xmm1, %ymm1 0usec 134,850,000 0x67fd 0 vpsllvq %ymm1, %ymm7, %ymm1 0usec 118,030,000 0x6802 0 xor %edx, %edx 0usec 103,530,000 0x6804 0 vptest %ymm1, %ymm0 0usec 168,780,000 0x6809 0 setb %dl 0usec 96,860,000 0x680c 246 add %rdx, %rax 0usec 143,550,000 0x680f 0 add $0x4, %rcx 0usec 140,940,000 0x6813 0 cmp %rsi, %rcx 0x6816 246 jnz 0x6752 <Block 8> 0usec 285,360,000 If we prefetch 50% more data(K==8 prefetches only 2 cache lines, K==11 prefetches 3 cache lines), and we never use it(when we are benchmarking unsucessful lookup) performance suffers. So if I just change one line (commenting out prefetch) unsuccessful lookups performance changes drastically(presumably because 3M is in the area where we are over L2 cache size of my CPU) *3M* with prefetch filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.07411 successful_lookup_time 5.62967 *unsuccessful_lookup_time 5.64557*mixed_lookup_time 11.9758 without prefetch const unsigned char* next_element(boost::uint64_t& h)const noexcept { auto p=ar.buckets+hs.next_position(h)*bucket_size; for(std::size_t i=0;i<prefetched_cachelines;++i){ //BOOST_BLOOM_PREFETCH((unsigned char*)p+i*cacheline); } return p; } filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 5.00679 FPR 0.159833 insertion_time 6.02206 successful_lookup_time 5.03205 *unsuccessful_lookup_time 3.58734*mixed_lookup_time 11.3664 For 1M there is no noticable difference between prefetch and no prefetch(as all data fits in L2), to confirm this is all about memory I tried 35M and here again commenting out prefetch helps with unsuccessful lookups. *35M* with prefetch: filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 58.4126 FPR 0.160926 insertion_time 17.7846 successful_lookup_time 15.6446 *unsuccessful_lookup_time 14.7353*mixed_lookup_time 21.5229 without prefetch: filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange> capacity MB 58.4126 FPR 0.160926 insertion_time 18.0509 successful_lookup_time 13.8877 *unsuccessful_lookup_time 11.4105*mixed_lookup_time 19.807 In the end this is kind of obvious, I was so focused on codegen that I forgot obvious things about caching. Still I do not think all this measurements were useless: I can imagine a world in which you expect most of lookups to be unsuccessful and then you disable prefetch except maybe for initial block (that is always needed).