Boost logo

Boost :

From: Vromen, Tomer (Tomer.Vromen_at_[hidden])
Date: 2025-05-22 15:35:04


Hi all,

Bottom line first:
In my opinion, the proposed Bloom library should be ACCEPTED to Boost.
It implements a useful data structure, and does it with high quality.
Joaquín, thank you for submitting this library to Boost!

Now for the details.

Background:

This is my first time reviewing a proposed Boost library. I recently finished implementing a Blocked Bloom filter, so this library is very interesting to me and I have some relevant knowledge. Sadly, my version of the BF requires some unique customizations which would likely make it difficult to use the proposed library in my code (I won’t go into details, as they are not relevant, and I wouldn't want the library to become encumbered by multiple customization options). I will only mention that in my implementation, reducing memory access to a single cache line was found to be critical for performance.

Design & documentation:

The design of the library looks solid, with good design choices. However, I’m missing the rationale behind those choices.

One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs.

Another design limitation is that to my understanding the library doesn’t actually offer an implementation of the classical Bloom filter (contradicting the claim in the first line of the docs). The default behavior is a variant that sets/gets bits in k different 1-byte blocks, and there is no option for a block-less BF. This is different from the classical Bloom filter in which 2 or more hashes can target the same byte, and even the same bit. The FPR will be very slightly higher than the FPR for a classical BF with the same parameters, which makes the current calculation very slightly off. I think this should at least be addressed in the documentation.

I wanted to use a Blocked BF with blocks of 1 cache line. It was not obvious from the docs that this is possible, because the docs say that the underlying type for the Block must be an unsigned integral type. I tried the following which seemed to work:

  using CacheLine = std::bitset<512>;
  using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine, K>>;

I would like to see this officially supported and documented as a main use-case.

The purpose of multiblock is not clear. Seems to me that only the _fast variants are useful, for their speed. Otherwise, a block with sufficient size (e.g. using std::bitset as above, but might need something with better implementation) would be better due to slightly lower FPR.

The documentation mentions many examples like this: “filter<1,block<uint32_t,K>>”
However, the first template parameter is missing, and this is confusing. The examples should all provide valid types and working code.
This appears in several places.

“For given values of n and m, the optimum k is the integer closest to [formula]”
Mathematically speaking, this is not strictly correct. For example, with m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you (very slightly) lower FPR. On the other hand, in this case I would still choose k=8 because 8 iterations is less than 9. So it's mostly a matter of wording.

“Without formal justification, we have relaxed this even further to just one initial hash value [...]”
You might be interested in SingleHash [https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which provides formal justification.

Implementation:

I only reviewed the code very lightly, and only tried to use the filter in some basic code, so I don’t have much to contribute in this area.

To use multiblock I needed to include a dedicated header, which is not needed for block. This is a bit confusing.

Estimating FPR based on array occupancy: You wrote in another thread that you “know how to estimate the number of elements from the number of bits set [… but the formula] is valid for the classical Bloom filter”. Since each block is itself a classical filter, and expected value is linear, you can apply the formula per block (using K') and then sum the total for the whole filter, and finally divide by K. I think this is a nice feature to have.

Summary:

As I wrote above, this is a useful high-quality library.
I don't think a single Bloom library can support all possible variations, so it's important to document the design choices that were made.

Thank you for contributing this code to Boost.

Tomer Vromen


Internal Use - Confidential


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk