Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Arash Partow (arash_at_[hidden])
Date: 2009-06-18 16:59:45


Dean Michael Berris wrote:
> Hi Arash,
>
> On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow wrote:
>> 1. Can you please explain why you need multiple "hashes"?
>
> You don't need multiple hashes, but if you do want to use more hash
> functions (K) you can. The point is the user should be able to provide
> the hash functions that the bloom filter will use internally.
>

Please read something other than wiki, there are many great citations on the wiki article that explain the most practical state of the art BFs. If you want something for boost then at least try and provide the most correct set of interfaces even if the underlying details are not the most correct.
In short you only need 2 independent hash functions to get the optimal fpp for a set of size m bits(an expected insertion amount is optional).

>> 2. Your code doesn't trivially exit on the contains
>
> It used to be that it did (in the version that didn't use
> Boost.Fusion). I can try and optimize still in cases when it
> encounters a '0' bit in between the hash function results.
>

this is what i see:

  bool contains(const_ref input) const {
      using fusion::for_each;
      typedef typename base::contains_impl contains_checker;
      bool found = true;
      for_each(
              hash_functions,
              contains_checker(bit_set, input, found)
              );
      return found;
  }

unless the contract for for_each has changed (which it hasn't both in stl and fusion):

1. it doesn't exit
2. if you even bother to look at the machine code generated, you'd see that even with the best available set of opt flags,
its still miles behind a simple loop with a break statement. You have a copy constructor being called, do you know the cost that will incur when calling contains 10^9 times?!

>> 3. Your solution doesn't provide a means to determine the various optimal
>> parameter combinations
>
> What do you mean by this?
>

There are various (many) use-cases for a BF

1. user provides max size and max fpp BF determines best k
2. user provides max fpp and max expected insertion BF determine best k,m
3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp
4. etc etc...

Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.

>> 4. Your solution doesn't provide a means for counting modes
>
> Modes for bloom filters? I'm not sure I understand what you mean.
>

There are modes of BFs where instead of 1 bit per block you use more say 4 bits and instead of setting high you increment a counter.
this allows for a probabilistic determination of how many times an element has been inserted into a BF, this is called a counting bloom filter.
this is another of MANY important use-case you have not covered.

>> 5. Your solution doesn't provide for erasures (do not confuse with removal
>> of inserted items)
>
> .clear() is supposed to clear the internal bitset and clear the bloom filter.
>

No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.

>> 6. Your solution doesn't provide for self superimposed compression mode
>> (similar to union but doubling of fpp)
>>
>
> The idea of the bloom_filter is to provide precisely that, a bloom
> filter -- any compression you might want to do in transmitting a
> bloom_filter would IMO be beyond the scope of the library. There has
> been some suggestions of implementing a compressed_bloom_filter that
> maintains a compressed bitset, but afar from that I must say I have no
> idea what self superimposed compression mode.
>

This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.

Also another note looking at your code again (btw do you even use it?!), you're equality test should also include the hashes and any other seeding data you use to construct the BF.

Another thing that is missing is serialization of the BF, I would like to be able to save to file, or send it over a n/w. So a streaming operator or something inline with boost.serialize would be good.

Another thing to consider is that a BF implementation should not be a means to show-off one's abilities in the use of templates and meta-programming paradigms but rather something that is efficient and lean. Think about situations where cache hit-miss and coherency will be an issue (as these are some of the more critical uses of BFs), look at the machine code that is generated for some of the more critical routines insert/contains for various archs and ask yourself, what it is you're "really" trying to achieve.

Arash Partow
__________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk