Boost logo

Boost :

Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-06-21 12:55:12


Hi Alejandro,

Alejandro Cabrera wrote:
> I'm writing to request feedback, suggestions, and advice from the Boost
> community. I've been working on a Bloom filter data structure for about
> a month and a half now. At this point, the following is complete and
> available in the Boost Sandbox:
> - working, basic prototype (insert & query operations, union & intersect)
> - a test suite covering the public interface
> - five examples
> - documentation covering many aspects of the project (tested on Chrome
> 12 and 13, Firefox 4 and 6)

At first glance, this looks very good. Well done.

Bloom filters are essentially a quite simple concept, and I suggest
that you should stick to that i.e. resist any temptation to add lots of
extra features. Having said that, I think you should look at e.g.
std::set<> and std::unordered_set and see what you can do to make your
class closer to them. I can think of:

- empty()
- clear() (Ah, you do have this, but not documented?)
- a constructor that takes a range, e.g.
   int a[] = { 1,2,3,4,5 };
   bloom_filter<int> b(a,a+5);
   std::list<int> l;
   bloom_filter<int> c(l.begin(),l.end());
- insert(begin,end) similarly
- swap()
- You have a size() which is (confusingly?) different from those
containers' size(); it's more like capacity() I guess. (Similarly I
think I would rename count() to maybe bit_count().)

There is a trade-off between the size of the filter, the number of
values stored, the number of hash functions and the false positive
rate. You might consider adding some sort of compile-time
metafunctions to help the user determine the right size and number of
hash functions, based on their expected number of values stored and
acceptable false positive rate. Similarly, you could add a static
expected_false_positive_rate() method that takes the number of stored
values as a parameter.

Sometimes it would be better for size to be a run-time parameter.

Sometimes it would be useful to access the underlying data. For
example, say I have a spell check dictionary or a URL block-list; I
might want to compile that list into a bloom filter and ship that as a
binary file. For that I'd want to be able to use it as an adaptor on
top of an underlying container that I supply:

// compile:
std::vector<uint32_t> data(100000);
bloom_filter b(data); // acts as a container adaptor
b.insert(...);
file f;
f.write(data.begin(),data.end());

// use:
memory_mapped_file f("dictionary.bloom");
bloom_filter b(f);
if (b.contains(url)) ....

I think it would be sufficient for false_positive_rate() to return a
float. No-one cares about its 19th digit.

For some reason I have a mental block about how intersection works.
Union is OK; I can see that

set a;
set b;
bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );

but it's not clear to me if

bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect
b );

Can you convince me?

During the XInt review there was some discussion about whether
is_prime() should be called is_probably_prime(). The same argument
could be made here about whether your contains() should be
probably_contains(). If there were some six-letter word meaning
"probably contains" then I would support using it, but
"probably_contains" is a lot of typing.

Regarding hash functions: I believe I have read that some care is
needed when you use essentially the same hash function but with
different seeds i.e. are the results really independent? It would be
be more obviously correct to have a set of hash functions that use
entirely different algorithms. Unfortunately this is hard.

Performance: it would be great to see some benchmarks for different
types against std::set and std::unordered_set. In particular it would
interesting to look at string keys of different lengths: both the bloom
filter and the unordered_set always have to look at every byte of the
string to compute the hash, while a std::set need only look at the
string's unique prefix. You mention things like web caches, and so I
guess getting some real-world benchmark data would be simple enough.
One hashing option for strings is to hash substrings.

Anyway, good work so far.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk