Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-19 04:51:17


On Fri, Jun 19, 2009 at 3:18 PM, Arash Partow<arash_at_[hidden]> wrote:
> Dean Michael Berris wrote:
>>
>> When you say correct set of interfaces, what do you need more than
>> 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
>> case of a counting bloom filter, maybe you'd have 'remove(...)' or
>> 'erase(...)'. Other than these functions I don't know what you mean by
>> most correct set of interfaces.
>>
>
> Those are not the only methods, in fact there are a few constructors
> and some policies that can be provided.

Constructors maybe. Policies, I'm still a little iffy with this one.

I'd say I can derive policies from the existing implementation. If
there comes another special bloom filter, that gives me a sample of
three implementations from which to derive policies from. I'm not
about to go psychic and try to predict *all* the possible ways I can
split up the bloom filter implementation to cover all possible
variations of the bloom filter -- although that would be cool if I
could. ;-)

> Understanding the problem
> space is a good place to begin, seeing how BFs are used and not just
> assuming how they are used is another thing you should look into, I
> would also have a look at the boost.crc library, there's more to
> this than just an insert and contains. it could become quite a
> useful library, if it meets the needs of people.
>

But Boost.CRC isn't a container, there's no way you can compare the
bloom_filters implementation to Boost.CRC -- unless I'm missing your
point.

The problem space that I understand which Bloom Filters are trying to
address is this:

1. You need to represent a set of values.
2. You want the representation to be space efficient.
3. You can tolerate false positives.

Now, there are *a lot* of places where you can use a data structure
like this. One that pops to my head is a means of space-efficiently
checking whether you've already encountered a value before. This works
in filesystem caches, web proxy caches, in social networking there's
also a lot of uses. Some people use Bloom Filters merely for
representing sets of numbers, in case they were generating new
"random" identifiers.

Of course I cannot predict all the performance requirements of all
these use cases -- some would want compression in case of transmitting
the data over the network, some would want wicked fast checking, some
would want to represent absolutely huge sets and have the bitset
representation in memory compressed, etc. Do I want to deal with some
of these special cases? Oh yes, the compressed_bloom_filter is
something I'm thinking of implementing next, and there's also the
interesting counting_bloom_filter. Do I want to deal with them all?
Probably not.

>
>
>>>
>>>>> 2. Your code doesn't trivially exit on the contains
>>>> It used to be that it did (in the version that didn't use
>>>> Boost.Fusion). I can try and optimize still in cases when it
>>>> encounters a '0' bit in between the hash function results.
>>>>
>>> this is what i see:
>>>
>>>  bool contains(const_ref input) const {
>>>     using fusion::for_each;
>>>     typedef typename base::contains_impl contains_checker;
>>>     bool found = true;
>>>     for_each(
>>>             hash_functions,
>>>             contains_checker(bit_set, input, found)
>>>             );
>>>     return found;
>>>  }
>>>
>>> unless the contract for for_each has changed (which it hasn't both in stl
>>> and fusion):
>>>
>>> 1. it doesn't exit
>>
>> Oh, but it does. Have you tried the tests? Remember this is
>> Boost.Fusion --
>>
>> http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/algorithm/iteration/functions/for_each.html
>> -- it iterates on the 'hash_functions' sequence which is finite.
>>
>
> "Your code doesn't trivially exit", I'll repeat it again "Your code
> doesn't trivially exit" there is no talk about the for_each not
> ending. Its just that because you really want to push the idea of using
> metaprogramming, you've lost sight of the simplest most elegant
> solution, which is:
>
> for(i: 0,k-1)
> if (0 == m[h[i](d)])
>  return false;
> return true;
>

I know, but have you checked my initial implementation which is almost
like that:

bool result = true;
for (size_t i = 0; i < K && result; ++i)
  result = result && bit_set_[hashes[i](input)];
return result;

The reason I went with Boost.Fusion is because the sequence is
statically sized, and cannot be iterated through using runtime
identifiers/indexes like above. See, how do you get the ith element of
a tuple when i is defined at runtime? Do you know how to do this with
Boost.Fusion?

See I'm still looking for a way of making the current implementation
using Boost.Fusion short-circuit the loop. This means changing the
function object used in the for_each to include a check to whether
it's already encountered a 0 bit and not have to perform the hashing
function again. Because Boost.Fusion already unrolls the loop for me
at compile-time, I'm hoping the compiler will create the necessary
branches that makes it equivalent to the above runtime loop.

Now if I hadn't made the function objects part of the type as a
Boost.Fusion sequence then I run into the equality checking and
swap/assignment problem. Been there, done that, and the solution above
has made the implementation of 'contains' a little more complicated.
This is why I feel frustrated at the moment since this has already
been discussed on the list before too.

I know the burden is on me to get creative with how to address this
issue now. I welcome the challenge, and I assure you I'm working on
it. :-)

>
>>> 2. if you even bother to look at the machine code generated, you'd see
>>> that
>>> even with the best available set of opt flags,
>>> its still miles behind a simple loop with a break statement. You have a
>>> copy
>>> constructor being called, do you know the cost that will incur when
>>> calling
>>> contains 10^9 times?!
>>>
>>
>> Really? The Boost.Fusion version actually unrolls the loop for you, so
>> the cost of that for_each call is actually constant time. Sure,
>> there's no short-circuit (yet) but then the cost is constant.
>>
>
> If you actually ran you're own code, you would realize that
> unrolling a loop of calls to contains is far more beneficial than
> unrolling the loop inside contains (also true for insert).

Wait, but inside contains, you get the benefit without paying too much
of a cost. Boost.Fusion already unrolls the for_each for you, so I
don't see what the problem is. I always thought unrolling/optimizing
inner loops gave good results, and if you layer that on top of
unrolled outer loops then you've essentially eliminated the loops too.
I don't see why unrolling the inner loop (Boost.Fusion's for_each) is
a bad thing.

> Further
> more if iteration over the data to be hashed was mixed in with the
> hashing operations (eg: update hash_0-hash_k-1 with data_i) that would
> also be far more efficient than  "The Boost.Fusion version actually
> unrolls the loop for you" philosophy
>

But this is up to the hash function implemented, not in 'contains'.
See if Boost.Fusion already unrolled the loop in 'contains' and the
hash function implementation is already sufficiently optimized
internally, I don't see what the problem is.

BTW, if you check what the compiler already did, then you'll see that
Boost.Fusion is dealing with references (not copies) even with the
function object to apply, and that the unrolled loop is there, and all
that remains are the (almost always inlined) calls to the hash
functions.

>>
>> This is too complex -- and has too many assumptions. First, it assumes
>> that the user wants a self-(re)sizing bloom filter. Second it assumes
>> that the user doesn't want to provide their own hash functions. Third
>> it assumes that the user actually really cares if the bloom filter
>> should be smarter than the user.
>>
>> In some cases (like the simplest cases I'm trying to solve for the
>> moment) the current implementation seems sufficient. Of course, it's
>> not fancy, and I don't intend for it to become too fancy. If you think
>> you can implement something better, then please by all means show me
>> code that I can deal with -- this is not said sarcastically, I really
>> do want to see what you mean when you say all the above translated
>> into code.
>>
>
> What I read from this is: I'm trying to propose and eventually
> submit a library for which I don't fully understand its most basic
> use-cases, furthermore I can't be bothered to look into it any
> further without someone feeding me the answers - good attitude :)
>

No, what I was communicating was the fact that the bloom filter
implementation should be simple enough to fit simple cases, and good
enough to cover cases with a little more complexity. Now if you need
anything specific that's not yet covered by the current implemented
bloom filters, then maybe they're different types that need to be
implemented still. I personally may not be able to implement all of
these before (and if ever) this gets reviewed and into the main Boost
distribution, but that in the off chance that you do want to actually
contribute code I am open to suggestions and collaboration.

Now I don't see what's wrong with proposing and submitting a simple
library _that works_ and iterating according to requirements of the
community. So far I see that there's only one person asking for a set
of highly complex specialized bloom filters, and I'm questioning
whether it's really a good use of my time trying to chase corner cases
for (currently) unimplemented bloom filters. ;-)

>>
>> I thought 'counting modes' meant 'Mode' in the statistical sense and
>> getting these statistical modes.
>>
>> I would have understood it better if you meant I wasn't implementing
>> counting bloom filters or bloomier filters or <insert new variation of
>> a highfalutin name> bloom filter.
>>
>> Anyway, the point is to provide a bloom filter using the textbook
>> definition. If there's a need for other kinds of bloom filters, like
>> counting bloom filters, then that's another implementation that can be
>> bundled in the library later -- *if* there is enough need for it, and
>> *if* the library even gets into Boost just trying to solve the common
>> (basic) cases. Of course the library is designed to evolve.
>>
>
> A smart representation of the internal data and some policies could
> cater for both modes without having to create an xxx_bf class for
> each new mode. Again as mentioned above requires understanding the
> problem domain and not just 5mins on google.
>

The smart representation is already there for the basic use case (a
Boost.Dynamic_bitset and an STL bitset). Now, in the case of
compressed internal representations for the compressed_bloom_filter
suggestion, that has yet to be implemented. Also, the
counting_bloom_filter is also a different thing.

Notice that they're different types? Remember, you derive policies,
not predict them. I think the implementation can improve and evolve,
but I don't see why you're looking at the current version and thinking
I'm not thinking about changing them along the way... Have you assumed
that I've stopped thinking about this project already?

>
>
>>>>> 5. Your solution doesn't provide for erasures (do not confuse with
>>>>> removal
>>>>> of inserted items)
>>>> .clear() is supposed to clear the internal bitset and clear the bloom
>>>> filter.
>>>>
>>> No not clear. A bloom filter is to a degree error resilient/tolerant. if
>>> for
>>> some reason part of your bloom filter data is damaged (n/w io, bad memory
>>> etc...) you can erase that section (set all the bits high), doing so
>>> depending on how large the section is will increase the fpp. For some
>>> uses
>>> that is still quite acceptable, as the fpp has already been determined to
>>> be
>>> orders of magnitude less than what can be tolerated.
>>>
>>
>> If you want to go down this road, then you think about implementing
>> your own allocator with error correction support and use that in the
>> bloom_filter or any container that supports/uses custom allocators.
>> Maybe you'll want an STL vector that also knows how to determine if
>> part of the data is corrupted too?
>>
>
> An allocator allocates/deallocates memory thats all it does, nothing
> else, this is clearly a job for the class holding the data. Again
> I'm not going to feed you the answer as to why this is done as a
> person interested in proposing a library you should at the very
> least know about it and why its done.
>

If you had a use case where you actually expected a container to heal
itself due to memory corruption, then maybe I'll understand better
what you mean. As I've already responded to another email, I'm
thinking this should be a special case that doesn't need to be part of
a generic solution.

>>>
>>
>> I don't understand why you want to do this when clearly the Bloom
>> Filter is meant to have a fixed number of bits. Any variation to this
>> in transit is beyond the scope of the implementation of the Bloom
>> Filter.
>>
>
> Again if you can't see why decreasing the size of the data gradually
> has a benefit then.....
>

Sure, you want a policy that defines the internal storage, so you can
define a policy for compressed bloom filters. I get it. But how's that
different from implementing a compressed_bloom_filter?

>>
>> I guess you missed the earlier discussion: the point was that since
>> the hash functions are part of the bloom_filter's type, it's assumed
>> that two bloom_filter instances of the same type have the same hash
>> functions. If one bloom filter has a different encapsulated state
>> compared to another bloom filter of the same type, then they are not
>> equal -- simple, and easy to understand/implement. That's the point of
>> equality testing after all.
>>
>
> Ok my bad probably didn't catch that, but to be fair all i did was
> look at the code you've presented. Of all the information available
> on this library that is the most pertinent, not the back and forths
> on an ML
>

Actually if you read the code carefully, that's already the case even
if you didn't read the discussions on the list.

>>
>> Hmmm... I guess you haven't followed the discussion at all in the list.
>
> No you're right, all I need is to see code where you try and push the use of
> complex template algos with in-line copy constructors when all you
> need is a for-loop an index counter and a break. I apologize for not being
> enlightened by the discussion on the ML.
>

What in-line copy constructors are you talking about? I make
absolutely no copies in the code whatsoever when it comes to using
Boost.Fusion's for_each. Are you sure you know how Boost.Fusion works
because you seem afraid of the idioms used in the implementation?

>
>> Let me rehash a bit:
>>
>> Of course, if the library under construction isn't suiting your needs,
>> there should be room for improvement. But there's no need to be too
>> fancy with the implementation/interface and being clever with the
>> False Positive Probability
>
> If you actually use BFs you'll soon realize you need all these
> things. Otherwise all that it is, is an array with some hash
> functions. There is nothing wrong with wanting the bare minimum and
> most correct interface, isn't that one of the guiding principles of
> boost?
>

Sure, that's why I'm open to compressed_bloom_filter and
counting_bloom_filter and other bloom filter implementations
(bloomier_filters come to mind). But what I'm saying is that pending
these implementations, I'm sticking with the implementation that's
currently up there for the bare minimum use case which it suits just
fine. There's still room for improvement as you've pointed out in the
contains() function and I'll look into adressing the trivial exit
problem.

> Look I've been a bit brash, I apologize. I know you're keen to make a
> contribution to boost and get your name up in lights, you've got that
> function input iter thing (whatever that is) pending. Its always
> nice to see people eager and keen contribute to open source stuff, so
> don't let me or my whims discourage you. :D
>

Has nothing to do with getting the name up in lights. I guess I was
just looking for clearer explanations/thoughts than just one-liners
and vague statements. Hopefully if you can get more verbose with what
you mean compared to just silently implying things, I guess this
medium lends itself better to discussion. Otherwise, it's really hard
to understand what you meant with the earlier list of one-liners and
the way I knew how to get more information from you was to engage you
in discussion.

Don't worry, they're more encouraging than discouraging really. Now
that just means there's more work for me. :-)

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk