Subject: [Boost-commit] svn:boost r73341 - sandbox/bloom_filter/trunk/libs/bloom_filter/doc/html
Date: 2011-07-24 19:42:08
Date: 2011-07-24 19:42:07 EDT (Sun, 24 Jul 2011)
New Revision: 73341
The design page has been updated to include better descriptions of the current state of Bloom filter package. Comparisons between static and dynamic variants are also provided.
Text files modified:
sandbox/bloom_filter/trunk/libs/bloom_filter/doc/html/design.html | 94 +++++++++++++++++++++++++++++++--------
1 files changed, 75 insertions(+), 19 deletions(-)
--- sandbox/bloom_filter/trunk/libs/bloom_filter/doc/html/design.html (original)
+++ sandbox/bloom_filter/trunk/libs/bloom_filter/doc/html/design.html 2011-07-24 19:42:07 EDT (Sun, 24 Jul 2011)
@@ -39,49 +39,106 @@
- <li>Basic Bloom Filter</li>
- <li>Dynamic Basic Bloom Filter</li>
- <li>Counting Bloom Filter</li>
+ <li>Basic Bloom Filters</li>
+ <li>Counting Bloom Filters</li>
<li>Default Hash Functions</li>
- This Bloom filter implementation is designed to allow compile-time determination of storage space(in bits), insertion/query type, and hash functions. The design goals were:
+ This project was initiated with a few design goals in mind. In particular,
+ these data structures should be easy to use, easy to customize, efficient,
+ and as portable as possible without damaging the usability for most users.
+ These Bloom filters should be very easy to experiment with, as well.
+ This project acknowedlges that no solution is ideal for every problem.
+ With these Bloom filters, part of the design focused on how to
+ maximize the configurability without sacrificing the safety of the
+ data structure. Upfront templating is used to a great extent
+ in order to make these goals a reality.
+ For each Bloom filter variant, a fully static version and a partially
+ dynamic version is provided. The static version is meant to be used
+ when the stack is not a limiting factor - for smaller data sets. It
+ is meant to be slightly faster than the dynamic variant. The dynamic
+ variants requests its storage from the heap, and as a result,
+ can be as large as main memory and not incur any performance
+ <a name="basic_bloom"></a>
+ <h3>Basic Bloom Filters</h3>
+ The basic Bloom filter implementation is designed to allow compile-time
+ determination of storage space(in bits), insertion/query type, and
+ hash functions. The design goals were:
<li>Easy to use correctly</li>
- <li>Not difficult to customize regarding use of hash functions</li>
+ <li>Easy to use multiple, different hash functions</li>
<li>Compact - the amount of space used by the Bloom filter is capped by the size of the underlying std::bitset</li>
<li>Type-safe - its impossible to insert an element of the wrong type into the Bloom filter.</li>
<li>Type-safe - its impossible to perform a union or intersect operation between incompatible Bloom filters.</li>
- To meet these design goals, the Bloom filter is implemented using template parameters. There are three templates:
+ To meet these goals, the implementation performs as much work as possible
+ at compile-time. For the static variant, most of the operations can be
+ optimized away at compile time using c++0x mode in GCC. In all cases,
+ all core operations are broken down into 2 steps, performed
+ for each hasher:
+ <li>Generate a hash value - this is the index.</li>
+ <li>Perform the core operation at that index.</li>
+ <a name="cbf"></a>
+ <h3>Counting Bloom Filters</h3>
+ The counting Bloom filter implementation is designed to allow compile-time
+ determination of number of bins, bits per bin, insertion/removal/query element type,
+ bit operation genularity, and hash functions. The design goals were:
- <li>Size - number of bits to use for state representation</li>
- <li>Type - type of elements to hash</li>
- <li>HashFunctions - a boost::mpl::vector of hash functions objects</li>
+ <li>Easy to use correctly</li>
+ <li>Easy to customize the storage: bits per bin and number of bins.</li>
+ <li>Easy to use multiple, different hash functions</li>
+ <li>Compact - the amount of space used by the counting Bloom filter is at most
+ a few bytes greater than the number of bins multiplied by the number of bits per bin.</li>
+ <li>Type-safe - its impossible to insert/remove an element of the wrong type into the Bloom filter.</li>
+ <li>Type-safe - its impossible to perform a union or intersect operation between
+ incompatible Bloom filters.</li>
- Currently, the Bloom filter executes the hash functions one at a time. It is not hard to imagine an implementation that dispatches lightweight threads to perform each of the k hash functions.
+ To meet these goals, the static variant allows customization of many details
+ at compile-time. The dynamic variant allows customization of all parameters
+ except number of bins at compile-time. Furthermore, if a bin underflow or
+ overflow would occur, an exception is thrown. Exceptions are used since
+ they interfere the least with the common case path, protecting performance.
+ This protects the user from introducing false negatives into the operation
+ of their Bloom filter.
<h4>Default Hash Functions</h4>
- To make the Bloom filter easy to use, the HashFunctions template parameter has a default value. This way, if the user wishes to use a Bloom filter for a quick, non-critical task, the user has the ability to just declare and use.
- The default hash function set consists of three boost_hash function objects. These are implemented by computing Boost.Functional hash_value and incrementing the result by the seed. The boost_hash function object is made available in the file boost/bloom_filter/hash/default.hpp for users that wish to use more or fewer hash functions to use their Bloom filter without having to create their own hash function object.
+ To make the Bloom filter easier to use, the HashFunctions template parameter
+ has a default value. This way, if the user wishes to use a Bloom filter for
+ a quick, non-critical task, the user has the ability to just declare and use.
- <strong>Be warned</strong> - the Boost default hash function when hashing integral types returns the unsigned version of the integer itself. This makes it such that using the default Bloom filter works against you. In this case, it's best to use a single boost_hash function object with seed set to 0.
+ Be aware: it is not wise to use multiple hash functions of the same type
+ as the elements of your HashFunctions vector. This introduces a problem
+ where the hashes calculated are not independent, therefore reducing
+ the effectiveness of the Bloom filter, e.g., as the number of hash
+ functions increases, so does the false positive rate.
@@ -114,9 +171,8 @@
<li>Bloom filter variations implemented as separate classes:
- <li>Counting Bloom filter (deletion support): <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.6789&rep=rep1&type=pdf">
- Summary Cache. Fan et al. ACM SIGCOMM. 1998.
+ <li><del>Counting Bloom filter (deletion support): <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.6789&rep=rep1&type=pdf">
+ Summary Cache. Fan et al. ACM SIGCOMM. 1998.</a></del></li>
<li>Scaling Bloom filter (constant false positive rate): <a href="http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBsQFjAA&url=http%3A%2F%2Fgsd.di.uminho.pt%2Fmembers%2Fcbm%2Fps%2Fdbloom.pdf&ei=ZH_2TbboIeeu0AH5o4XtDA&usg=AFQjCNFGN0RS1-bUCxZJwUmD0C8cTnllQg&sig2=__OvO8j0bdnj5gDH0xt4mw">
Scalable Bloom Filter. </a>Almeida et al. Elsevier. 2007.
Boost-Commit list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk