Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-09 04:40:37


Just an update -- as of Sandbox revision 53772, I've addressed some of
the issues below:

On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael
Berris<mikhailberis_at_[hidden]> wrote:
> On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin_at_[hidden]> wrote:
>>
>> So you can still get a lot of feedback till the design
>> reaches a mature state. IMHO there're plenty of
>> opportunities for improvement, here are a handful of
>> observations in no particular order:
>>
>> 1. Your material lacks proper tests and examples, and
>> the docs are sketchy (a formal reference would be
>> needed at least).
>
> Okay. I can work on adding more tests (or at least having more
> "formal" unit tests).
>

I've added a couple more tests, hardly "formal" by definition, but is
an improvement over the single test file that's implemented.

>> 2. bloom_filter should have an allocator template
>> parameter, just like any other container around.
>
> Okay, I'll just forward this anyway to the underlying data structure
> (bitset) so that should be easy to add.
>

And apparently I mis-typed -- std::bitset<> doesn't have an allocator
parameter. Because of this, I've chosen to not support for the
meantime a custom allocator.

>> 3. Have you considered the possibility to let
>> the user specify M and K at constuction time rather
>> than compile tiem and the pros and cons of both
>> approaches?
>
> Yes, I have but I've put a premium at type safety and putting M and K
> as part of the type instead of being a runtime value. The reasons are:
>
> 1. Specifying a static M allows the use of the standard bitset instead
> of Boost's dynamic_bitset.
> 2. A static K allows for enforcing the number of hash functions at
> compile time instead of at runtime.
> 3. Since M and K are not variable and are part of the type, it makes
> little sense for me to have either of them set at construction time.
>
> One thing I might be amenable to is a static K and a runtime M, but
> that makes the hash functions unable to determine statically the
> bounds of the bitset (M). I can change the concept for the hash
> function, but the simplicity of size_t(Input) just makes it easier for
> users I think.
>

I thought about it a little more and I don't think I like the idea of
a separate static and dynamic bloom_filter. If there are more reasons
to allow for dynamically growing bloom filters, then I might support
this.

>> 4. It'd be nice to have a way to specify the
>> desired false positive probability instead of
>> M and K.
>
> That'd be nice, but then the hash functions would have to be provided
> for by the implementation too to ensure that the probability is held.
>

I will not be supporting this anytime soon. ;-)

>> 5. It'd be nice if entry-level users can get
>> a default set of hash functions so that they
>> need not specifiy them --maybe a parameterized
>> hash formula h(x,N) so that you can just use
>> h(x,0), h(x,1),... h(x,N-1).
>
> I agree. I can use Boost.CRC's functions and implement a simple
> mod-hash against M. Let me see what I can come up with. BTW, this may
> require that the actual bloom_filter<> template be renamed to
> something like static_bloom_filter<> and have a dynamic_bloom_filter<>
> version as well that supports runtime-setting of M like something I
> describe above. The bloom_filter<> will derive from
> static_bloom_filter<> and provide a set of basic hash functions so
> that users don't have to worry about providing their own hash
> functions.
>

This is already done, put it in
boost/bloom_filter/detail/default_hash.hpp -- with a naive offset and
mod-hash functions. Yes, this should be improved to use Boost.Hash or
other implementations easily. ;-)

> Or, I can support a default-constructible bloom_filter<> that provides
> these functions too. This sounds like something I'd be able to
> implement quickly.
>

Which is already the case. A default constructed bloom filter uses an
unsophisticated hash function internally.

>> 6. You're not providing a free function
>> swap(bloom_filter&,bloom_filter&).
>
> Not yet, you're correct. Let me address this soon.
>

Done.

>> 7. bloom_filters are not currently testable for
>> equality.
>
> You're correct. I'll add this soon.
>

Done, but I've run into the issue of not being able to compare
boost::function objects of the same type -- for some reason this has
been disabled in Boost Trunk.

>> 8. bloom_filters are not currently serializable.
>
> True. I might need to have a file 'boost/bloom_filter/serialize.hpp'
> to add serializability non-intrusively.
>

This should come later... I still need to look into whether
serialization should only be done for the bitset state rather than the
whole type. I say this because up to now I don't think there's a way
of serializing boost::function objects and deserializing them
effectively either.

>> 9. Union and intersection of bloom_filters are not
>> provided (according to the Wikipedia entry these
>> ops are trivial to implement).
>
> True.
>

And I need to implement these later (if there's an actual need for this).

>> 10. For completeness, seems like constructing a
>> bloom_filter from a bitset_type would be nice to
>> have.
>
> I agree, this is something that can be added soon as I allow for
> having a default set of hash functions for bloom filters.
>

And this is done.

HTH

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk