Boost logo

Boost :

From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-05-20 15:10:09


El 18/05/2025 a las 23:38, Ivan Matek escribió:
> [...]
> Few more questions I thought of recently, apologies in advance if I
> misunderstood something.
>
> 1. There is some SIMD code that seems to have cost in K that is
> proportional to "greater or equal multiple of 8". I wonder does
> that affect recommendation for optimal K or not? e.g  if table we
> disused before tells me to use K 14 should I bump it up to 16 if
> using fast_multiblock64?
>  static BOOST_FORCEINLINE bool check(const value_type&
> x,boost::uint64_t hash)
>   {
>     for(int i=0;i<k/8;++i){
> if(!check_m256ix2(x[i],hash,8))return false;
>       hash=detail::mulx64(hash);
>     }
> if(k%8){
> if(!check_m256ix2(x[k/8],hash,k%8))return false;
>     }
>     return true;
>   }
>

This k%8 check doesn't really happen at run time, as k is a compile-time
value, so k=14 should be (slightly) faster than k=16. So, I don't
particulary
recommend that a multiple of 8 be used if that's not optimal in terms of
achieved FPR. That said, you can see in

https://github.com/joaquintides/boost_bloom_benchmarks

that execution times for fast_multiblockXX<K>, although roughly proportional
to K, seem to occasionally favor k%8==0. So, in the end I'm afraid
you'll have
to measure to come up with the best configuration for your environment.

> 1. Fact that bloom filter has a break in lookup(you can break as soon
> as one bit is not set(let's ignore simd now) makes me wonder if
> performance depends on hit rate? E.g. If I know my hit rate will
> be ~10% or ~90% should that affect my picking of K?
>

block and multiblock do not do early termination on lookup,
while fast_multiblockXX do. The reason for that is that this
is what was fastest overall in each case.

As for the hit rate, well, I guess if you know most lookups
are going to be negative, then you can maybe use a higher k.
But then, for a given capacity and number of inserted
elements, there's no point in increasing k beyond its optimal
value (that yielding the smallest possible FPR).

> 1. I have just noticed I wrongly remembered that g++ defaults to
> march=native, it does not. Why is this related to this library?
> Well I am not sure fallback in fast_ headers should exist. In
> other words when I want to use fast_ and AVX2 is not activated I
> would prefer code to fail to compile. So I would remove this
> fallback:
> template<std::size_t K>
> using fast_multiblock64=multiblock<boost::uint64_t,K>;
> Now there is reasonable argument that people do not want their
> code to break when they change compiler flags, or target new
> architecture(e.g. one where you did not add SIMD for), but I
> believe SIMD speedup is so huge that this should not be a silent
> change. If users really want fallback behavior they can do their
> own preprocesor + using logic.
>

That's a matter of opinion, I guess, but I'd rather have people not
wanting the fallback write the compile-time check instead of the
other way around. Sometimes you're not writing a final application
but a library (say, on top of candidate Boost.Bloom), and you don't
control compilation flags or target architecture.

> 1. Another thing related to SIMD is question of onesarrays used in
> make_m256ix2/make_m256i. Initially I have assumed they are not
> static const(guaranteed construction just once, but thread safe
> init overhead) because compiler is always smart enough to optimize
> this to some readonly memory that will be initialized at compile
> time and everything is great. I looked into asm of my simple
> program test I wrote on my machine, initialization is done at
> compile time, great.
> But :) then I went to godbolt, and made small reproduction of
> code, and there I can not <https://godbolt.org/z/sYfc7rffa> get
> the compilers to avoid populating that array at runtime. Now I do
> not expect for you to go over godbolt link and try to nudge
> compilers to optimize this by changing flags or example code, my
> question is more about if this is fragile is that maybe a reason
> to use some ugly trick to initialize those values at compile
> time(e.g. as constexpr array of properly aligned uint32 that when
> interpreted as ones has correct values). If anybody reading this
> has any idea why this happens please do tell. My first guess is
> that I messed up when porting code to small godbolt example. :) My
> second guess would be that in bloom code I extracted for godbolt
> is used in a way that hints to compiler to do the initialization
> at compile time(e.g. kp values are known at compile time), but
> this is just a guess. So to be clear there is no problem I can
> detect when using bloom library, I am just worried that with a
> huge combinations of compilers, standards, optimization
> levels(e.g. O2 vs O3) and architectures there may be a subset of
> users that will hit performance issue if bloom does not force
> compile time init of those values.
>
>
> To be more specific, for this source:
>
> int main()
> {
> using simd_filter = boost::bloom::filter<std::string_view,1, boost::bloom::fast_multiblock64<2>, 1>;
> simd_filter sf;
> char str[2] = {char('a'+(rand()%26)),'\0'};
> sf.insert(str);
> str[0] =char('a'+(rand()%26));
> return sf.may_contain(str);
> }
> on my machine clang generates beautiful asm in a sense there is no
> initialization of ones at runtime. When I copy godbolt example to my
> machine initialization of ones at runtime exists, so it is probably
> not just godbolt/my machine difference.
>

Your godbolt snippet behaves differently to candidate Boost.Bloom
because the values of kp are not known at compile time (which is the
case with candidate Boost.Bloom). If you instead do

 Â  std::array vals = {5,6,7};

https://godbolt.org/z/M7W5Tnbsq

then the code becomes noticeably simpler and in one compiler (Clang)
the ones table goes away completely. In general, having the table
marked as static or not makes no difference, compilers are smart
enough to handle that regardless. I fail to see any run-time table
initialization in your original snippet at https://godbolt.org/z/sYfc7rffa .

Joaquin M Lopez Munoz


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk