Boost logo

Boost :

Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: Alejandro Cabrera (cpp.cabrera_at_[hidden])
Date: 2011-06-23 15:55:59


Jakub,

> Having bloom_filter in boost:: is great idea! It is super useful, we use in
> production code (our own implementation (actually we use Bloomier filters
> with murmur hash perfect hashing from your wiki reference).
> It lets us use very memory and lookup efficient data structure for very
> large datasets (think 100 GB file with strings on SSD disk indexed by e.g.
> 200 MB Bloomier filter in memory)

I've read about Bloomier filters, though those seem a ways off still.
I'm building from the most basic to the more complex variants, in general.

Your statistics sound amazing, to say the least! Working with very large
data isn't something I've had the opportunity to do just yet, though I'm
interested and aware of such scenarios, as my research area is in
operating systems and storage issues.

> Did you considered adding serialization of a bloom_filter to your
> implementation?

I haven't considered it at length, but there must be a way to encode the
type of the Bloom filter uniquely prior to writing out its bits, so that
when deserialization time comes, a user wouldn't accidentally
instantiate a Bloom filter of the wrong type. The idea I have in mind at
the moment is to first serialize the template parameters of the Bloom
filter (Type, Size, HashFunctions). The Type and the HashFunctions
appear to be the trickiest, since their structure can take on as many
possibilities as the type system allows. Writing out the bits should be
the easiest part.

If you have any ideas in this regard, I'd appreciate it if you shared them.

> In general reconstructing hash based containers with series of inserts is
> pretty inefficient.

I can imagine this to be the case, especially if the containers are
storing a very large number of elements.

> Use case that I'm talking about: e.g. for you web proxy scenario proxy
> service keeps running and downloading, caching URLs and adding them to bloom
> filter. Than the process needs to be restarted for some reason. All
> documents downloaded and stored on disk will have to be reitereted and their
> URLs reinserted to newly created bloom_filter, which makes the startup of
> the process slow.
> btw I have the same problem with standard containers (except std::vector).
> There is no efficient serialization / deserialization for them rendering
> them useless for any larger side project (like unordered_set of 1m strings).
>
>

There was a Boost Serialization library in the works. It seems that the
last time it was maintained was 7 years ago. I'm not sure what generic,
type-safe [de]serialization would look like - it sounds like a project
all on its own! I agree that the C++ standard would benefit greatly from
a means to [de]serialize data structures, much like Python has the
pickle module.

Thank you for your feedback, Jakub.
-Alej


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk