![]() |
Boost : |
From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-05-17 18:36:54
Hi everybody,
I was reading the documentation/code, and I have some questions/comments.
1. Why is unit of measurement bit? I understand that conceptually bloom
filters(maybe others, did not check) are bit arrays, but for me this seems
like very error prone. Would verbose alternative of using
boost::bloom::bits(1000) or boost::bloom::bytes(125) be preferred? I did
not suggest UDL, as I feel for such rarely used class it may be overkill.
2. boost::hash<short> (and other integral ones) afaik are identity, I
presume this is not an issue because you do extra steps if
hash_is_avalanching is false, but maybe users should prefer avalanching
hash? If so maybe suggest fast avalanching hash available in boost... btw
I tried to read the docs but I got(not a big issue, since I read them from
original. version): 404 Not Found
- Code: NoSuchKey
- Message: The specified key does not exist.
- Key:
develop.bloom.cpp.al/unordered/doc/html/unordered/reference/hash_traits.html
- RequestId: JHRNESDX8BYY6T2G
- HostId:
nz9kkYamNQ4ocu023JneA7PyMHTEyQK9jgMSSBuwReE0OQAslcN0xOZxgopXpvn/63AFQzreXgM=
3. libs/unordered/include/boost/unordered/detail/mulx.hpp contains mulx
implementation that seems very similar to one used in bloom, and authors
are same. Although this is tiny duplication, would it be best to extract it
somewhere to avoid duplication?
4. Is there no benefit in having a power of 2 sized bloom filters? AFAIK
it is trick used by hash maps (requires hash is good) to get faster
indexing. I see you use clever 128bit multiplication that generates fast
mul instruction to avoid division, but I still presume shift and masking
would be faster.
5. Why is BOOST_ASSERT(fpr>=0.0&&fpr<=1.0); not
BOOST_ASSERT(fpr>0.0&&fpr<=1.0);
, i.e. is there benefit of allowing calls with impossible fpr argument?
This may sound like nitpicking, but I think it is wrong to allow calls with
arguments that give people wrong idea of what bloom filters are capable. I
understand that this functionality is under capacity *esimation**,*but I
still think it may be confusing.
6. Related to 6. : It is relatively easy to OOM with this if you are
"reckless" with fpr value, I would add a suggestion to docs to mention
this(if it is not already there, maybe I missed it).
7. It may be just me, but I believe Filter Definition chapter would
benefit from visual example, in particular how is BucketSize related to
subarray size.
8. Regarding choosing optimal value: In the example we use 10M
elements. From what I see they are unused in our computations till the end
so I would rewrite: "*Suppose we plan to insert 10M elements and want
to keep the FPR at 10â4. The chart gives us five possibilities:*" to
something like
" Suppose we plan to insert 10M elements and want to keep the FPR at
10â4. Looking up FPR in the chart(this step is not dependent on number of
elements) gives us five possibilities:"
9. My IDE complains about redundant inline on member functions. Is this
intentionally done since (afaik msvc) some compilers use inline as
optimization(inlining) hint or?
10. Should #if 1 in may_contain be there?
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk