Boost logo

Boost :

From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-05-17 18:36:54


Hi everybody,

I was reading the documentation/code, and I have some questions/comments.

   1. Why is unit of measurement bit? I understand that conceptually bloom
   filters(maybe others, did not check) are bit arrays, but for me this seems
   like very error prone. Would verbose alternative of using
   boost::bloom::bits(1000) or boost::bloom::bytes(125) be preferred? I did
   not suggest UDL, as I feel for such rarely used class it may be overkill.
   2. boost::hash<short> (and other integral ones) afaik are identity, I
   presume this is not an issue because you do extra steps if
   hash_is_avalanching is false, but maybe users should prefer avalanching
   hash? If so maybe suggest fast avalanching hash available in boost... btw
   I tried to read the docs but I got(not a big issue, since I read them from
   original. version): 404 Not Found
      - Code: NoSuchKey
      - Message: The specified key does not exist.
      - Key:
      develop.bloom.cpp.al/unordered/doc/html/unordered/reference/hash_traits.html
      - RequestId: JHRNESDX8BYY6T2G
      - HostId:
      nz9kkYamNQ4ocu023JneA7PyMHTEyQK9jgMSSBuwReE0OQAslcN0xOZxgopXpvn/63AFQzreXgM=
   3. libs/unordered/include/boost/unordered/detail/mulx.hpp contains mulx
   implementation that seems very similar to one used in bloom, and authors
   are same. Although this is tiny duplication, would it be best to extract it
   somewhere to avoid duplication?
   4. Is there no benefit in having a power of 2 sized bloom filters? AFAIK
   it is trick used by hash maps (requires hash is good) to get faster
   indexing. I see you use clever 128bit multiplication that generates fast
   mul instruction to avoid division, but I still presume shift and masking
   would be faster.
   5. Why is BOOST_ASSERT(fpr>=0.0&&fpr<=1.0); not
BOOST_ASSERT(fpr>0.0&&fpr<=1.0);
   , i.e. is there benefit of allowing calls with impossible fpr argument?
   This may sound like nitpicking, but I think it is wrong to allow calls with
   arguments that give people wrong idea of what bloom filters are capable. I
   understand that this functionality is under capacity *esimation**,*but I
   still think it may be confusing.
   6. Related to 6. : It is relatively easy to OOM with this if you are
   "reckless" with fpr value, I would add a suggestion to docs to mention
   this(if it is not already there, maybe I missed it).
   7. It may be just me, but I believe Filter Definition chapter would
   benefit from visual example, in particular how is BucketSize related to
   subarray size.
   8. Regarding choosing optimal value: In the example we use 10M
   elements. From what I see they are unused in our computations till the end
   so I would rewrite: "*Suppose we plan to insert 10M elements and want
   to keep the FPR at 10−4. The chart gives us five possibilities:*" to
   something like
   " Suppose we plan to insert 10M elements and want to keep the FPR at
   10−4. Looking up FPR in the chart(this step is not dependent on number of
   elements) gives us five possibilities:"
   9. My IDE complains about redundant inline on member functions. Is this
   intentionally done since (afaik msvc) some compilers use inline as
   optimization(inlining) hint or?
   10. Should #if 1 in may_contain be there?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk