Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-09 00:52:33


On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin_at_[hidden]> wrote:
> Dean Michael Berris <mikhailberis <at> gmail.com> writes:
>
>> The implementation is now available in Boost's Sandbox. You can check
>> out the implementation along with the documentation via:
>>
>> [...]
>>
>> I'd also like to request to add this to the formal review queue (if
>> this qualifies for a fast track review, that would also be an option
>> I'm open to).
>
> With all due respect, I think your submission is in too early
> a state to already request a formal review. I'd say you are
> in "refinement" stage as described in:
>
> http://www.boost.org/development/submissions.html
>

Oh, okay. Thanks for pointing this out.

> So you can still get a lot of feedback till the design
> reaches a mature state. IMHO there're plenty of
> opportunities for improvement, here are a handful of
> observations in no particular order:
>
> 1. Your material lacks proper tests and examples, and
> the docs are sketchy (a formal reference would be
> needed at least).

Okay. I can work on adding more tests (or at least having more
"formal" unit tests).

> 2. bloom_filter should have an allocator template
> parameter, just like any other container around.

Okay, I'll just forward this anyway to the underlying data structure
(bitset) so that should be easy to add.

> 3. Have you considered the possibility to let
> the user specify M and K at constuction time rather
> than compile tiem and the pros and cons of both
> approaches?

Yes, I have but I've put a premium at type safety and putting M and K
as part of the type instead of being a runtime value. The reasons are:

1. Specifying a static M allows the use of the standard bitset instead
of Boost's dynamic_bitset.
2. A static K allows for enforcing the number of hash functions at
compile time instead of at runtime.
3. Since M and K are not variable and are part of the type, it makes
little sense for me to have either of them set at construction time.

One thing I might be amenable to is a static K and a runtime M, but
that makes the hash functions unable to determine statically the
bounds of the bitset (M). I can change the concept for the hash
function, but the simplicity of size_t(Input) just makes it easier for
users I think.

> 4. It'd be nice to have a way to specify the
> desired false positive probability instead of
> M and K.

That'd be nice, but then the hash functions would have to be provided
for by the implementation too to ensure that the probability is held.

> 5. It'd be nice if entry-level users can get
> a default set of hash functions so that they
> need not specifiy them --maybe a parameterized
> hash formula h(x,N) so that you can just use
> h(x,0), h(x,1),... h(x,N-1).

I agree. I can use Boost.CRC's functions and implement a simple
mod-hash against M. Let me see what I can come up with. BTW, this may
require that the actual bloom_filter<> template be renamed to
something like static_bloom_filter<> and have a dynamic_bloom_filter<>
version as well that supports runtime-setting of M like something I
describe above. The bloom_filter<> will derive from
static_bloom_filter<> and provide a set of basic hash functions so
that users don't have to worry about providing their own hash
functions.

Or, I can support a default-constructible bloom_filter<> that provides
these functions too. This sounds like something I'd be able to
implement quickly.

> 6. You're not providing a free function
> swap(bloom_filter&,bloom_filter&).

Not yet, you're correct. Let me address this soon.

> 7. bloom_filters are not currently testable for
> equality.

You're correct. I'll add this soon.

> 8. bloom_filters are not currently serializable.

True. I might need to have a file 'boost/bloom_filter/serialize.hpp'
to add serializability non-intrusively.

> 9. Union and intersection of bloom_filters are not
> provided (according to the Wikipedia entry these
> ops are trivial to implement).

True.

> 10. For completeness, seems like constructing a
> bloom_filter from a bitset_type would be nice to
> have.

I agree, this is something that can be added soon as I allow for
having a default set of hash functions for bloom filters.

> 11. You'd have to decide whether you want to make
> this look more like a std::map<T,bool> or not.
> If the former, some boilerplate typedefs
> like value_type and memer functions like size()
> could be provided.
>

I'm still struggling with this mainly because I'd think a bloom filter
would better model an std::set's behavior, but without access to the
"stored" elements (mainly because the elements aren't really stored).

Thanks for the great insights and feedback, I'll try implementing
these things sooner than later.

Have a great day!

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk