Boost logo

Boost :

From: Kostas Savvidis (kotika98_at_[hidden])
Date: 2025-05-18 00:18:26


Hi All,

This is a mini review of the proposed Bloom library.
I have spent a few hours researching the background and a couple of hours
with the example programs, modified it in various ways
 and did not encounter any showstopper problems.
I am knowledgeable in the areas of probability, random numbers and related mathematical areas.

Cheers,
Kostas

=========================================
Institute of Nuclear and Particle Physics
NCSR Demokritos
http://inspirehep.net/author/profile/K.G.Savvidy.1
https://mixmax.hepforge.org

====================================================================
CONSIDERATIONS:
This library implements the classic Bloom filter which can be used to implement
an O(1)-time computation of whether a given object belongs in a previously preprocessed set.
The preprocessing takes O(n) time and importantly O(n) in storage for a bit array, but the original data can be optionally discarded.
   I was reminded of the storage requirement while experimenting with the provided
example for genomic data: the example datafile is 140MB, while the resulting filter
contains a lookup table taking up 180MB (!!!), with the 0.01 setting for target false positive rate (FPR).
Further, I have compared the "predicted" false positive error rate obtained from the library function
genome_filter::fpr_for(n, f.capacity()) with the actual FPR.
This FPR appears to be overpessimistic, as experiments with the real data showed.
Setting the filter FPR to 0.5 or even 0.9 (which results in a much smaller lookup table)
there were almost no false positives in practice.
    I do not fully understand the full reasons for this, but it appears one of the reasons is:
THE "SET" WE INSERTED HAD MANY IDENTICAL ITEMS,
In this case, the stream had many identical k-mers, to be precise, out of 142 million base pairs,
there is actually ~22 million identical 20-mers, but even when FPR is set to 0.9 only 40 million extra false collisions are generated
which is way less than 90%. Further investigation suggests that fpr_for is far from reality:
the bit array for FPR=0.9 is 2^28=268m bits wide, of which only 142m - 22m = 120m are occupied, thus
the correct FPR rate is 120/268 = 44%, in rough agreement with the observed 40m/120m.
    Similarly, when I request fpr=0.01 or 1%, the number of false collisions is only 86k, so actual FPR is probably
closer to 2*86k/120m = 0.14%.
    This necessitates my recommendation 2), see below.

EASE OF USE:
Appears to be reasonable. Some amount of code to realize the filter needs to be written by hand.
In particular, if the items are long strings or some other nontrivial datatype, one needs to
write a function called "hash_value" which maps the item to a std::size_t.

DOCUMENTATION
Clear writing, background is discussed, examples are good, and references to relevant papers are provided.

DESIGN:
Is solid to my eyes, with the exception of a hand rolled random number generator "mcg_and_fastrange" in boost/bloom/detail/core.hpp
Maybe this is good enough, but it is not very good. It attempts two things at once,
generate a position in the bit array to insert a bit as well as "expand" a hash. However, when the bucket size is a power of two, say m=2^28
the multiplier will be chosen as 2^28+3, needless to say it means the lower 28 bits will behave badly, as if hash was produced with
h' = h*3 mod 64. The technique of producing a random integer on a range 0..m according to Lemire is solid, but it cannot be combined to produce the random numbers themselves. Hence my recommendation 3).

CONCLUSION: The popularity of the Bloom filters is possibly due to actual performance (false positive rate) being better in practice than estimated apriori.
This is not for free and neither it is magic, as it comes at the cost of large memory storage, which is 1.44*n*log2(1/ε), or 5-6 bits per item.
Nevertheless, it is definitely a useful general-purpose algorithm.

RECOMMENDATIONS:
1) Accept the library, with conditions,
2) provide a second more accurate way to compute FPR, not based on apriori estimate from the number of elements already inserted
    but on the actual number of bits that are set in the bit array. This quantity could be computed at small cost during insertion by keeping track of collisions,
    or evaluated after the fact from the full scan of the bit array (total Hamming weight).
3) for rng, use h' = h*a mod 2^64 with a=6364136223846793005, the known good multiplier from Knuth, then high(h*m) for random position


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk