Boost :

arnaud.becheler_at_[hidden]

---
Hi all,
Bottom line first:
In my opinion, the proposed Bloom library should be ACCEPTED to Boost.
It implements a useful data structure, and does it with high quality.
JoaquÃn, thank you for submitting this library to Boost!
Now for the details.
Background:
This is my first time reviewing a proposed Boost library. I recently
finished implementing a Blocked Bloom filter, so this library is very
interesting to me and I have some relevant knowledge. Sadly, my version of
the BF requires some unique customizations which would likely make it
difficult to use the proposed library in my code (I wonâ€™t go into details,
as they are not relevant, and I wouldn't want the library to become
encumbered by multiple customization options). I will only mention that in
my implementation, reducing memory access to a single cache line was found
to be critical for performance.
Design & documentation:
The design of the library looks solid, with good design choices. However,
Iâ€™m missing the rationale behind those choices.
One example is that k is a compile-time constant. As a user this is far
from optimal. I do not care about the number of bits used for each key,
only that it is optimal. In practice, the optimal value for k is a function
of the target FPR. In other cases the target is the memory overhead (m/n)
and k is derived from that. I can choose m, n, and the FPR at runtime, but
I have to decide on k in compile-time. I understand that there are
implementation considerations in favor of making k a compile-time constant,
I just wish to have that explained in the docs.
Another design limitation is that to my understanding the library doesnâ€™t
actually offer an implementation of the classical Bloom filter
(contradicting the claim in the first line of the docs). The default
behavior is a variant that sets/gets bits in k different 1-byte blocks, and
there is no option for a block-less BF. This is different from the
classical Bloom filter in which 2 or more hashes can target the same byte,
and even the same bit. The FPR will be very slightly higher than the FPR
for a classical BF with the same parameters, which makes the current
calculation very slightly off. I think this should at least be addressed in
the documentation.
I wanted to use a Blocked BF with blocks of 1 cache line. It was not
obvious from the docs that this is possible, because the docs say that the
underlying type for the Block must be an unsigned integral type. I tried
the following which seemed to work:
  using CacheLine = std::bitset<512>;
  using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine,
K>>;
I would like to see this officially supported and documented as a main
use-case.
The purpose of multiblock is not clear. Seems to me that only the _fast
variants are useful, for their speed. Otherwise, a block with sufficient
size (e.g. using std::bitset as above, but might need something with better
implementation) would be better due to slightly lower FPR.
The documentation mentions many examples like this:
â€œfilter<1,block<uint32_t,K>>â€
However, the first template parameter is missing, and this is confusing.
The examples should all provide valid types and working code.
This appears in several places.
â€œFor given values of n and m, the optimum k is the integer closest to
[formula]â€
Mathematically speaking, this is not strictly correct. For example, with
m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you
(very slightly) lower FPR. On the other hand, in this case I would still
choose k=8 because 8 iterations is less than 9. So it's mostly a matter of
wording.
â€œWithout formal justification, we have relaxed this even further to just
one initial hash value [...]â€
You might be interested in SingleHash [
https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which
provides formal justification.
Implementation:
I only reviewed the code very lightly, and only tried to use the filter in
some basic code, so I donâ€™t have much to contribute in this area.
To use multiblock I needed to include a dedicated header, which is not
needed for block. This is a bit confusing.
Estimating FPR based on array occupancy: You wrote in another thread that
you â€œknow how to estimate the number of elements from the number of bits
set [â€¦ but the formula] is valid for the classical Bloom filterâ€. Since
each block is itself a classical filter, and expected value is linear, you
can apply the formula per block (using K') and then sum the total for the
whole filter, and finally divide by K. I think this is a nice feature to
have.
Summary:
As I wrote above, this is a useful high-quality library.
I don't think a single Bloom library can support all possible variations,
so it's important to document the design choices that were made.
Thank you for contributing this code to Boost.
Tomer Vromen
---