Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-10 05:22:53


On Wed, Jun 10, 2009 at 1:51 AM, Dean Michael
Berris<mikhailberis_at_[hidden]> wrote:
> On Tue, Jun 9, 2009 at 11:58 PM, <joaquin_at_[hidden]> wrote:
>>
>> Umm, fair point :) this raises two points:
>>
>> 2.1. Allocators are still needed if you implement a variant where M (and
>> maybe K)
>> are passed at construction time.
>
> Indeed. I'm leaning towards making this the default behavior using
> Boost.Dynamic_bitset instead and implementing the
> static_bloom_filter<> template that uses std::bitset internally now.
>

Now I've decided that boost::bloom_filter<> will take a runtime
parameter which is the size of the internal bitset (a
Boost.Dynamic_bitset) -- and the static version that uses std::bitset
is boost::static_bloom_filter<>.

>> 2.2. Is it wise to use the stack-based std::bitset when M is huge? I think
>> it's not
>> unlikely that M can be very large (say millions) which can easily lead to
>> stack overflow problems. This might bring allocators back into scene.
>>
>
> You're right, it doesn't make it practical to use huge bitsets that
> sit on the stack. Perhaps Boost.Dynamic_bitset is my friend. ;-)
>

This has been addressed in revision 53785 of the Boost Sandbox. :-)

>>>>
>>>> One thing I might be amenable to is a static K and a runtime M, but
>>>> that makes the hash functions unable to determine statically the
>>>> bounds of the bitset (M). I can change the concept for the hash
>>>> function, but the simplicity of size_t(Input) just makes it easier for
>>>> users I think.
>>>>
>>
>> I'd ask you to think this over before dismissing it: I can envision
>> scenarios where M
>> is not known at compile time (for instance, when dealing with external data
>> whose
>> size is not known beforehand).
>>
>> We can easily enter into overdesign land, but you can always let this be
>> driven by
>> a policy, so that everybody's happy.
>>
>
> Now that you mention it, I agree. :-) First I'll do two things:
>
> - Make the runtime-constructed bloom_filter the default implementation
> (uses Boost.Dynamic_bitset and allows for custom allocators)

This is done.

> - Rename the current implementation as static_bloom_filter that takes
> Input, M, and a Fusion Sequence of Hash Function types as arguments.
>

Also done. :)

>>>
>>> Done, but I've run into the issue of not being able to compare
>>> boost::function objects of the same type -- for some reason this has
>>> been disabled in Boost Trunk.
>>>
>>
>> Ummm, this is tricky, because of course it does not suffice to test the
>> state only --it
>> critically depends on the hash functions to be used (At first sight I
>> thought you only
>> needed to test state and ignore auxiliary objects like std containers do).
>>
>
> Yup. But apparently, if I make the hash functions part of the type
> instead of be function objects passed in at runtime, that would solve
> the problem (based on earlier comments in the thread).
>

Which is also done. :D

>>
>> Intrusiveness is not necessarily related with having a separate
>> serialize.hpp (you can
>> friend declare inside bloom_filter in bloom_filter.hpp and have the
>> implementation
>> elsewhere), but it's a good idea nonetheless.
>>
>
> The friend declaration might be an option, and if I can't do it
> non-intrusively I'll fall back to this option. Thanks for pointing
> this out. :-)
>

And I've yet to work on this. ;-)

HTH!

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk