|
Boost : |
Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Alejandro Cabrera (cpp.cabrera_at_[hidden])
Date: 2011-06-22 13:56:22
Hello Phil,
> Hi Alejandro,
>
> Alejandro Cabrera wrote:
>> I'm writing to request feedback, suggestions, and advice from the Boost
>> community. I've been working on a Bloom filter data structure for about
>> a month and a half now. At this point, the following is complete and
>> available in the Boost Sandbox:
>> - working, basic prototype (insert& query operations, union& intersect)
>> - a test suite covering the public interface
>> - five examples
>> - documentation covering many aspects of the project (tested on Chrome
>> 12 and 13, Firefox 4 and 6)
> At first glance, this looks very good. Well done.
>
> Bloom filters are essentially a quite simple concept, and I suggest
> that you should stick to that i.e. resist any temptation to add lots of
> extra features. Having said that, I think you should look at e.g.
> std::set<> and std::unordered_set and see what you can do to make your
> class closer to them. I can think of:
>
> - empty()
> - clear() (Ah, you do have this, but not documented?)
I accidentally skipped documenting this. Thank you for pointing this out.
> - a constructor that takes a range, e.g.
> int a[] = { 1,2,3,4,5 };
> bloom_filter<int> b(a,a+5);
> std::list<int> l;
> bloom_filter<int> c(l.begin(),l.end());
> - insert(begin,end) similarly
I hadn't considered various ways to construct the Bloom filter. I
assumed users would always want to start with an empty Bloom filter, but
this would make using the Bloom filter more flexible.
> - swap()
I overlooked swap. I'll add this.
> - You have a size() which is (confusingly?) different from those
> containers' size(); it's more like capacity() I guess. (Similarly I
> think I would rename count() to maybe bit_count().)
I agree that using size() as I do is confusing, versus what size() has
come to mean for standard containers. This would better be named
something like bit_count(). I'll reconsider the name.
> There is a trade-off between the size of the filter, the number of
> values stored, the number of hash functions and the false positive
> rate. You might consider adding some sort of compile-time
> metafunctions to help the user determine the right size and number of
> hash functions, based on their expected number of values stored and
> acceptable false positive rate. Similarly, you could add a static
> expected_false_positive_rate() method that takes the number of stored
> values as a parameter.
Given a desired false positive rate, an expected number of elements and
an upper-limit on the storage requirement, it is possible to determine
both a good Size value and a good number of hash functions. I see now
how meta-functions can help alleviate the burden on the user for picking
good values.
This might take a bit of work to provide as a feature. It would involve
ratios to represent the false positive rate as a template parameter to
the meta-function. I'll put this feature off 'til version 1.1 or 1.2.
Until that time, users will have to determine the best parameters for
their application using the documentation guidelines. On that note, I
think I should add a section to the tutorial detailing how to choose
good parameters.
> Sometimes it would be better for size to be a run-time parameter.
I have a dynamic Bloom filter in the queue for variants. It's definitely
required if I were to develop a scalable Bloom filter that guarantees a
certain false positive rate, as detailed in a recent research paper
titled "Scalable Bloom Filters".
> Sometimes it would be useful to access the underlying data. For
> example, say I have a spell check dictionary or a URL block-list; I
> might want to compile that list into a bloom filter and ship that as a
> binary file. For that I'd want to be able to use it as an adaptor on
> top of an underlying container that I supply:
>
> // compile:
> std::vector<uint32_t> data(100000);
> bloom_filter b(data); // acts as a container adaptor
> b.insert(...);
> file f;
> f.write(data.begin(),data.end());
>
> // use:
> memory_mapped_file f("dictionary.bloom");
> bloom_filter b(f);
> if (b.contains(url)) ....
I'm not sure how this would work. Could you elaborate further in this
regard?
Currently, as I understand, there's two components of interest. The
first is to be able to access the underlying Bloom filter data for
serialization/deserialization purposes. This should be easy enough to do
given a function that exposes the Bloom filter data. The other component
that I'm picking up on is to be able to construct a Bloom filter using
various sources. Range construction will be added that semantically
behaves as an insertion for each element in the range. However, I'm not
sure how to support construction from a memory_mapped_file.
> I think it would be sufficient for false_positive_rate() to return a
> float. No-one cares about its 19th digit.
I preferred greater precision to anticipate user needs. Unless
false_positive_rate() were called frequently, I wouldn't expect that the
added precision would incur a significant performance penalty.
> For some reason I have a mental block about how intersection works.
> Union is OK; I can see that
>
> set a;
> set b;
> bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );
>
> but it's not clear to me if
>
> bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect
> b );
>
> Can you convince me?
Intersect intends to leave only the bits shared between two bloom
filters on. This also corresponds to the elements that both Bloom
filters have inserted. The reason for this is because intersection is
only allowed between Bloom filters that use the same hash function set
and the same storage size. Assume that both Bloom filters are
instantiated with hash functions that return the integer value of the
element hashed. Then the following occurs:
bloom_filter<int, 4> a;
bloom_filter<int, 4> b;
bloom_filter<int, 4> c;
a.insert(1);
a.insert(2);
// 0110
b.insert(2);
b.insert(3);
// 0011
c = a & b;
// 0010
At this point, only c.contains(2) will return true.
> During the XInt review there was some discussion about whether
> is_prime() should be called is_probably_prime(). The same argument
> could be made here about whether your contains() should be
> probably_contains(). If there were some six-letter word meaning
> "probably contains" then I would support using it, but
> "probably_contains" is a lot of typing.
The wording for contains() is another tricky issue. probably_contains()
has the benefit that it points out the probabilistic nature of the data
structure. This is very helpful to warn users. I don't think the
addition of 9 more characters to type (probably_) should be a problem,
since it isn't the type of function that can nest on itself
(x.contains(t).contains...). I will tentatively rename the function
probably_contains(), and if further comments indicate that this adds
some level of confusion, I'll reconsider the name again.
> Regarding hash functions: I believe I have read that some care is
> needed when you use essentially the same hash function but with
> different seeds i.e. are the results really independent? It would be
> be more obviously correct to have a set of hash functions that use
> entirely different algorithms. Unfortunately this is hard.
I've left the hashing component open to extension exactly because of
this reasoning. The Bloom filter data structure does rely greatly on the
use of good hash functions. However, the definition of "good" depends on
the context. It might be that a CRC32 is perfect for some applications,
but a murmurhash3 is better for others, and some applications might even
require a cryptographically secure hash. Since the intent of the project
is not to develop great hash functions, I'm inclined to leave the option
of boost::hash_value as the default, though it might be best to default
to an mpl::vector containing a single hash function instead of two,
since as you stated, the results aren't independent.
> Performance: it would be great to see some benchmarks for different
> types against std::set and std::unordered_set. In particular it would
> interesting to look at string keys of different lengths: both the bloom
> filter and the unordered_set always have to look at every byte of the
> string to compute the hash, while a std::set need only look at the
> string's unique prefix. You mention things like web caches, and so I
> guess getting some real-world benchmark data would be simple enough.
> One hashing option for strings is to hash substrings.
>
I suspect that a benchmark on at least strings would be useful to
demonstrate how the bloom_filter compares against std::set and
std::unordered_set, assuming that most users will be hashing data (file
names, URLs, etc.).
I'll add benchmarks to my queue of things todo, but I want to make sure
the interface is close to stable before I do commit to writing benchmarks.
> Anyway, good work so far.
>
>
> Regards, Phil.
>
Thank you for taking the time to comment so thoroughly on the Bloom
filter project!
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk