Boost logo

Boost :

Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-06-25 13:22:04


Alejandro Cabrera wrote:
>> Sometimes it would be useful to access the underlying data. For
>> example, say I have a spell check dictionary or a URL block-list; I
>> might want to compile that list into a bloom filter and ship that as a
>> binary file. For that I'd want to be able to use it as an adaptor on
>> top of an underlying container that I supply:
>>
>> // compile:
>> std::vector<uint32_t> data(100000);
>> bloom_filter b(data); // acts as a container adaptor
>> b.insert(...);
>> file f;
>> f.write(data.begin(),data.end());
>>
>> // use:
>> memory_mapped_file f("dictionary.bloom");
>> bloom_filter b(f);
>> if (b.contains(url)) ....
> I'm not sure how this would work. Could you elaborate further in this
> regard?
>
> Currently, as I understand, there's two components of interest. The
> first is to be able to access the underlying Bloom filter data for
> serialization/deserialization purposes. This should be easy enough to do
> given a function that exposes the Bloom filter data. The other component
> that I'm picking up on is to be able to construct a Bloom filter using
> various sources. Range construction will be added that semantically
> behaves as an insertion for each element in the range. However, I'm not
> sure how to support construction from a memory_mapped_file.

OK, let me expand on the example I started above. Say I have a long
list of "banned" URLs and an interactive program that must check URLs
entered by the user against that list. I pre-compile the bloom filter
and save its raw binary content to a file. Then in the application, I
map that data into memory. The point about memory-mapping is that the
data is not actually read from the file until it is used, so in this
case the bloom filter's ctor can be very fast; when the user types in a
URL, only those pages of the file that contain the bits corresponding
to the URL's hashes are read. This is both faster and more
memory-efficient than loading all of the data.

In order for this to work, it's only necessary that I can make a bloom
filter that is an adaptor over a range; the range that I provide will
be a pair of pointers that refer to the memory region where the file is
mapped. I.e. something like this (pseudo-code):

int fd = open("banned_urls.dat", O_RDONLY);
const char* begin = mmap(0, size, PROT_READ, fd, 0);
const char* end = begin + size;
bloom_filter_adaptor<const char*> banned_urls(begin,end);

Currently you're using std::bitset for the underlying data. Since that
doesn't (AFAIK) have a way to adapt an underlying implementation data
type, I think you'll need to replace it with a wrapper that does do
that. Then you can have something like this:

template <typename ITER_T>
class bloom_filter_adaptor {
   typedef bitset_adaptor<ITER_T> impl_t;
   impl_t impl;
public:
   bloom_filter_adaptor(ITER_T begin, ITER_T end):
     impl(begin,end)
   {}
   ...
};

(An alternative to taking an iterator-pair is to take a container and
store a reference to it.)

Then the non-adaptor version, for when that flexibility is not needed:

template <size_t SIZE>
struct bloom_filter_base {
   typedef std::vector<uint32_t> data_t;
   data_t data;
   bloom_filter_base():
     data(0,SIZE)
   {}
};

template <size_t SIZE>
class bloom_filter:
   bloom_filter_base<SIZE>,
   bloom_filter_adaptor<bloom_filter_base::data_t>
{
public:
   bloom_filter():
     bloom_filter_adaptor(data.begin(),data.end())
   {}
};

...or something like that.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk