Boost logo

Boost :

From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-05-18 09:55:40


El 17/05/2025 a las 20:36, Ivan Matek via Boost escribió:
> Hi everybody,
>
> I was reading the documentation/code, and I have some questions/comments.
>
>
> 1. Why is unit of measurement bit? I understand that conceptually bloom
> filters(maybe others, did not check) are bit arrays, but for me this seems
> like very error prone. Would verbose alternative of using
> boost::bloom::bits(1000) or boost::bloom::bytes(125) be preferred? I did
> not suggest UDL, as I feel for such rarely used class it may be overkill.

I'm using bits because that's the unit used universally on the
literature on Bloom filters.
But yes, this bit/byte confusion may lead to user errors (I've been
bitten myself by
it occasionally during development). As for libraries out there, some
use bits
(to name a few, Apache Common Collections, Arash Partow's, fastbloom)
and some
use bytes (Impala).

I'm happy to do whatever the communitty decides is best here.

> 2. boost::hash<short> (and other integral ones) afaik are identity, I
> presume this is not an issue because you do extra steps if
> hash_is_avalanching is false,
Correct.
> but maybe users should prefer avalanching
> hash? If so maybe suggest fast avalanching hash available in boost...
What's that fast avalanching hash you refer to?
> btw
> I tried to read the docs but I got(not a big issue, since I read them from
> original. version): 404 Not Found
> - Code: NoSuchKey
> - Message: The specified key does not exist.
> - Key:
> develop.bloom.cpp.al/unordered/doc/html/unordered/reference/hash_traits.html
> - RequestId: JHRNESDX8BYY6T2G
> - HostId:
> nz9kkYamNQ4ocu023JneA7PyMHTEyQK9jgMSSBuwReE0OQAslcN0xOZxgopXpvn/63AFQzreXgM=
Yes, this is because the documentation is written as if the library were
already part of
Boost, while this is not the case links to other parts of Boost will
fail. Arnaud comments
on this issue on his review announcement post.Thanks for the heads up
anyway.
> 3. libs/unordered/include/boost/unordered/detail/mulx.hpp contains mulx
> implementation that seems very similar to one used in bloom, and authors
> are same. Although this is tiny duplication, would it be best to extract it
> somewhere to avoid duplication?
The only place I can think of is Boost.Core, and I'm not sure this very
specific
code really belongs there as a public utility.
> 4. Is there no benefit in having a power of 2 sized bloom filters? AFAIK
> it is trick used by hash maps (requires hash is good) to get faster
> indexing. I see you use clever 128bit multiplication that generates fast
> mul instruction to avoid division, but I still presume shift and masking
> would be faster.
Shift and masking would be faster, but we're taking advantage of 128-bit
multiplication for indexing and further hash value production _in one fell
swoop_, so the speed up is significant vs. shift and masking plus an
additional
mixing procedure for the hash value. See

https://master.bloom.cpp.al/html/index.html#implementation_notes_bucket_location

for details.

> 5. Why is BOOST_ASSERT(fpr>=0.0&&fpr<=1.0); not
> BOOST_ASSERT(fpr>0.0&&fpr<=1.0);
> , i.e. is there benefit of allowing calls with impossible fpr argument?
fpr==0.0 is a legitimate (if uninteresting) argument value for capacity_for:

capacity_for(0, 0.0) --> 24
capacity_for(1, 0.0) --> 18446744073709549592

The formal reason why fpr==0.0 is supported is because of symmetry:
some calls to fpr_for actually return 0.0 (for instance, fpr_for(0, 100)).

> This may sound like nitpicking, but I think it is wrong to allow calls with
> arguments that give people wrong idea of what bloom filters are capable. I
> understand that this functionality is under capacity *esimation**,*but I
> still think it may be confusing.
> 6. Related to 6. : It is relatively easy to OOM with this if you are
> "reckless" with fpr value, I would add a suggestion to docs to mention
> this(if it is not already there, maybe I missed it).
Yes, I can add that warning, thanks for the suggestion.
> 7. It may be just me, but I believe Filter Definition chapter would
> benefit from visual example, in particular how is BucketSize related to
> subarray size.
Absolutely, will do that.
> 8. Regarding choosing optimal value: In the example we use 10M
> elements. From what I see they are unused in our computations till the end
> so I would rewrite: "*Suppose we plan to insert 10M elements and want
> to keep the FPR at 10−4. The chart gives us five possibilities:*" to
> something like
> " Suppose we plan to insert 10M elements and want to keep the FPR at
> 10−4. Looking up FPR in the chart(this step is not dependent on number of
> elements) gives us five possibilities:"
Yes, the catch is that the chart's x axis is not number of elements, but
rather
c = capacity / number of elements. I'll come up with a better wording to
avoid
confusions.
> 9. My IDE complains about redundant inline on member functions. Is this
> intentionally done since (afaik msvc) some compilers use inline as
> optimization(inlining) hint or?
Can you provide the full warning messages you're getting and for which
environment this is happening? BOOST_FORCEINLINE is used quite liberally to
nudge Visual Studio, which is otherwise notoriously lazy at inlining.
> 10. Should #if 1 in may_contain be there?

That's a leftover from the microoptimization phase during lib
development. Harmless as it is, but I will remove it eventually.

Joaquin M Lopez Munoz


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk