|
Boost : |
Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Topher Cooper (topher_at_[hidden])
Date: 2011-06-26 13:50:23
On 6/25/2011 1:53 PM, Phil Endecott wrote:
> I believe I have convinced myself that a bitwise AND on bloom filters
> gives something that is a superset of the intersection of the
> contents, i.e. the false positive rate is increased. A proof would be
> good.
Easily done:
Take two Bloom filters, A and B, with the same hash-functions (h of
them) and the same size vectors (N). If the number of bits set in one
of them is s, then when h<<N and s>>1 the probability of a false
positive is approximately (h/N)^s, so the false positive rate increases
with the proportion of the bits that are set.
If an element appears in both sets, then the bits corresponding to any
element that appears in both sets will be set in both vectors and will
then be set in the bit-wise conjunction. That same element will
therefore appear as a positive in the conjunction so the conjunction at
least includes the intersection.
For any bit that is not among the bits that would be set in the properly
constructed Bloom filter for (A intersect B) there is some probability
that it is set by one of the elements in A's set that is not in B's
/and/ by one of the elements in B's set that is not in A's, so it will
appear in the conjunction bits. This means that the "s" for the
conjunction will be larger than the s for the intersection so the false
positive rate will be higher for the conjunction than for the intersection.
The probability is fairly high for any particular bit not in the
intersection to be set in the conjunction if the intersection is small
relative to the number of elements in A and the number of elements of
B. If the hash functions are each applied to the entire N bits, then
there will be h*|A-B| opportunities for a bit to be set from an element
in set A that is not in B and h*|B-A| opportunities for it to be set
from an element in set B that is not in A. If each hash is applied to a
distinct subset of oN/h bits from the total, then the probability is
smaller but still not insignificant: |A-B| opportunities for elements
in A but not in B times |B-A| opportunities for elements in B but not A.
Topher Cooper
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk