Boost logo

Boost :

Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project
From: jakub szymanski (qba.szymanski_at_[hidden])
Date: 2011-06-21 10:06:56


Having bloom_filter in boost:: is great idea! It is super useful, we use in
production code (our own implementation (actually we use Bloomier filters
with murmur hash perfect hashing from your wiki reference).
It lets us use very memory and lookup efficient data structure for very
large datasets (think 100 GB file with strings on SSD disk indexed by e.g.
200 MB Bloomier filter in memory)

Did you considered adding serialization of a bloom_filter to your
implementation?
In general reconstructing hash based containers with series of inserts is
pretty inefficient.
Use case that I'm talking about: e.g. for you web proxy scenario proxy
service keeps running and downloading, caching URLs and adding them to bloom
filter. Than the process needs to be restarted for some reason. All
documents downloaded and stored on disk will have to be reitereted and their
URLs reinserted to newly created bloom_filter, which makes the startup of
the process slow.
btw I have the same problem with standard containers (except std::vector).
There is no efficient serialization / deserialization for them rendering
them useless for any larger side project (like unordered_set of 1m strings).

--
View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-Filter-Project-tp3614026p3614200.html
Sent from the Boost - Dev mailing list archive at Nabble.com.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk