Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Alejandro Cabrera (cpp.cabrera_at_[hidden])
Date: 2011-06-21 13:35:10
> On 6/21/2011 12:21 AM, Alejandro Cabrera wrote:
>> Hello all,
>> I'm writing to request feedback, suggestions, and advice from the Boost
>> community. I've been working on a Bloom filter data structure for about
>> a month and a half now. At this point, the following is complete and
>> available in the Boost Sandbox:
>> - working, basic prototype (insert& query operations, union& intersect)
>> - a test suite covering the public interface
>> - five examples
>> - documentation covering many aspects of the project (tested on Chrome
>> 12 and 13, Firefox 4 and 6)
> I would suggest to you that you briefly explain what a Bloom filter
> entails, and why it might be important, when sending your message to
> this list.
Thank you for the suggestion. The data structure is introduced below.
Briefly, a Bloom filter is a probabilistic data structure that
represents a set of elements compactly. Two operations are supported in
the most basic variant: insert, contains. An element is inserted by
computing k hash values and setting the bits at the locations given by
the hash values to 1. Checking whether an element is contained in the
Bloom filter is achieved by hashing the element and checking that the
bits at those locations are all 1. A Bloom filter may incorrectly tell
you that an element is in the set (false positive). If false positives
are acceptable, Bloom filters are a powerful data structure to avoid
expensive look ups.
There exist variants that also support element removal at the cost of an
increase in storage requirements and the danger of false negatives -
these are not covered by Boost.BloomFilter just yet.
Bloom filters have seen use in file systems (avoiding duplicate writes,
deduplication), network caches (determining whether to attempt to fetch
from cache), databases (does a row/column exist on disk?) and hardware
design. Where the primary interest is to compactly represent a
potentially very large universe of elements to mitigate latency
differences, Bloom filters are a great choice.
Further details can be found in the project documentation files.