Boost logo

Boost :

From: Arnaud Becheler (arnaud.becheler_at_[hidden])
Date: 2025-05-22 20:15:48


Dear Boost community,

Tomer encountered some issues with the ML and asked me to forward their
review. Since the deadline is approaching, if anyone else is interested in
reviewing, please let me know—we may be able to arrange an extension.

Best regards,
Arnaud Becheler
Review Manager, Boost.Bloom

---
Hi all,
Bottom line first:
In my opinion, the proposed Bloom library should be ACCEPTED to Boost.
It implements a useful data structure, and does it with high quality.
Joaquín, thank you for submitting this library to Boost!
Now for the details.
Background:
This is my first time reviewing a proposed Boost library. I recently
finished implementing a Blocked Bloom filter, so this library is very
interesting to me and I have some relevant knowledge. Sadly, my version of
the BF requires some unique customizations which would likely make it
difficult to use the proposed library in my code (I won’t go into details,
as they are not relevant, and I wouldn't want the library to become
encumbered by multiple customization options). I will only mention that in
my implementation, reducing memory access to a single cache line was found
to be critical for performance.
Design & documentation:
The design of the library looks solid, with good design choices. However,
I’m missing the rationale behind those choices.
One example is that k is a compile-time constant. As a user this is far
from optimal. I do not care about the number of bits used for each key,
only that it is optimal. In practice, the optimal value for k is a function
of the target FPR. In other cases the target is the memory overhead (m/n)
and k is derived from that. I can choose m, n, and the FPR at runtime, but
I have to decide on k in compile-time. I understand that there are
implementation considerations in favor of making k a compile-time constant,
I just wish to have that explained in the docs.
Another design limitation is that to my understanding the library doesn’t
actually offer an implementation of the classical Bloom filter
(contradicting the claim in the first line of the docs). The default
behavior is a variant that sets/gets bits in k different 1-byte blocks, and
there is no option for a block-less BF. This is different from the
classical Bloom filter in which 2 or more hashes can target the same byte,
and even the same bit. The FPR will be very slightly higher than the FPR
for a classical BF with the same parameters, which makes the current
calculation very slightly off. I think this should at least be addressed in
the documentation.
I wanted to use a Blocked BF with blocks of 1 cache line. It was not
obvious from the docs that this is possible, because the docs say that the
underlying type for the Block must be an unsigned integral type. I tried
the following which seemed to work:
  using CacheLine = std::bitset<512>;
  using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine,
K>>;
I would like to see this officially supported and documented as a main
use-case.
The purpose of multiblock is not clear. Seems to me that only the _fast
variants are useful, for their speed. Otherwise, a block with sufficient
size (e.g. using std::bitset as above, but might need something with better
implementation) would be better due to slightly lower FPR.
The documentation mentions many examples like this:
“filter<1,block<uint32_t,K>>”
However, the first template parameter is missing, and this is confusing.
The examples should all provide valid types and working code.
This appears in several places.
“For given values of n and m, the optimum k is the integer closest to
[formula]”
Mathematically speaking, this is not strictly correct. For example, with
m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you
(very slightly) lower FPR. On the other hand, in this case I would still
choose k=8 because 8 iterations is less than 9. So it's mostly a matter of
wording.
“Without formal justification, we have relaxed this even further to just
one initial hash value [...]”
You might be interested in SingleHash [
https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which
provides formal justification.
Implementation:
I only reviewed the code very lightly, and only tried to use the filter in
some basic code, so I don’t have much to contribute in this area.
To use multiblock I needed to include a dedicated header, which is not
needed for block. This is a bit confusing.
Estimating FPR based on array occupancy: You wrote in another thread that
you “know how to estimate the number of elements from the number of bits
set [… but the formula] is valid for the classical Bloom filter”. Since
each block is itself a classical filter, and expected value is linear, you
can apply the formula per block (using K') and then sum the total for the
whole filter, and finally divide by K. I think this is a nice feature to
have.
Summary:
As I wrote above, this is a useful high-quality library.
I don't think a single Bloom library can support all possible variations,
so it's important to document the design choices that were made.
Thank you for contributing this code to Boost.
Tomer Vromen
---

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk