Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-06-25 13:22:04

Next message: Phil Endecott: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"
Previous message: Joshua Boyce: "Re: [boost] SVN Ignore for Bin dirs"
In reply to: Alejandro Cabrera: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"
Next in thread: Alejandro Cabrera: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"

Alejandro Cabrera wrote:
>> Sometimes it would be useful to access the underlying data. For
>> example, say I have a spell check dictionary or a URL block-list; I
>> might want to compile that list into a bloom filter and ship that as a
>> binary file. For that I'd want to be able to use it as an adaptor on
>> top of an underlying container that I supply:
>>
>> // compile:
>> std::vector<uint32_t> data(100000);
>> bloom_filter b(data); // acts as a container adaptor
>> b.insert(...);
>> file f;
>> f.write(data.begin(),data.end());
>>
>> // use:
>> memory_mapped_file f("dictionary.bloom");
>> bloom_filter b(f);
>> if (b.contains(url)) ....
> I'm not sure how this would work. Could you elaborate further in this
> regard?
>
> Currently, as I understand, there's two components of interest. The
> first is to be able to access the underlying Bloom filter data for
> serialization/deserialization purposes. This should be easy enough to do
> given a function that exposes the Bloom filter data. The other component
> that I'm picking up on is to be able to construct a Bloom filter using
> various sources. Range construction will be added that semantically
> behaves as an insertion for each element in the range. However, I'm not
> sure how to support construction from a memory_mapped_file.

OK, let me expand on the example I started above. Say I have a long
list of "banned" URLs and an interactive program that must check URLs
entered by the user against that list. I pre-compile the bloom filter
and save its raw binary content to a file. Then in the application, I
map that data into memory. The point about memory-mapping is that the
data is not actually read from the file until it is used, so in this
case the bloom filter's ctor can be very fast; when the user types in a
URL, only those pages of the file that contain the bits corresponding
to the URL's hashes are read. This is both faster and more
memory-efficient than loading all of the data.

In order for this to work, it's only necessary that I can make a bloom
filter that is an adaptor over a range; the range that I provide will
be a pair of pointers that refer to the memory region where the file is
mapped. I.e. something like this (pseudo-code):

int fd = open("banned_urls.dat", O_RDONLY);
const char* begin = mmap(0, size, PROT_READ, fd, 0);
const char* end = begin + size;
bloom_filter_adaptor<const char*> banned_urls(begin,end);

Currently you're using std::bitset for the underlying data. Since that
doesn't (AFAIK) have a way to adapt an underlying implementation data
type, I think you'll need to replace it with a wrapper that does do
that. Then you can have something like this:

template <typename ITER_T>
class bloom_filter_adaptor {
   typedef bitset_adaptor<ITER_T> impl_t;
   impl_t impl;
public:
   bloom_filter_adaptor(ITER_T begin, ITER_T end):
     impl(begin,end)
   {}
   ...
};

(An alternative to taking an iterator-pair is to take a container and
store a reference to it.)

Then the non-adaptor version, for when that flexibility is not needed:

template <size_t SIZE>
struct bloom_filter_base {
   typedef std::vector<uint32_t> data_t;
   data_t data;
   bloom_filter_base():
     data(0,SIZE)
   {}
};

template <size_t SIZE>
class bloom_filter:
   bloom_filter_base<SIZE>,
   bloom_filter_adaptor<bloom_filter_base::data_t>
{
public:
   bloom_filter():
     bloom_filter_adaptor(data.begin(),data.end())
   {}
};

...or something like that.

Regards, Phil.

Next message: Phil Endecott: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"
Previous message: Joshua Boyce: "Re: [boost] SVN Ignore for Bin dirs"
In reply to: Alejandro Cabrera: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"
Next in thread: Alejandro Cabrera: "Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk