Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Arash Partow (arash_at_[hidden])
Date: 2009-06-19 03:18:39


Dean Michael Berris wrote:
> On Fri, Jun 19, 2009 at 4:59 AM, Arash Partow wrote:
>> Dean Michael Berris wrote:
>>> Hi Arash,
>>>
>>> On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow wrote:
>>>> 1. Can you please explain why you need multiple "hashes"?
>>> You don't need multiple hashes, but if you do want to use more hash
>>> functions (K) you can. The point is the user should be able to provide
>>> the hash functions that the bloom filter will use internally.
>>>
>> Please read something other than wiki, there are many great citations on the
>> wiki article that explain the most practical state of the art BFs. If you
>> want something for boost then at least try and provide the most correct set
>> of interfaces even if the underlying details are not the most correct.
>> In short you only need 2 independent hash functions to get the optimal fpp
>> for a set of size m bits(an expected insertion amount is optional).
>>
>
> When you say correct set of interfaces, what do you need more than
> 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
> case of a counting bloom filter, maybe you'd have 'remove(...)' or
> 'erase(...)'. Other than these functions I don't know what you mean by
> most correct set of interfaces.
>

Those are not the only methods, in fact there are a few constructors
and some policies that can be provided. Understanding the problem
space is a good place to begin, seeing how BFs are used and not just
assuming how they are used is another thing you should look into, I
would also have a look at the boost.crc library, there's more to
this than just an insert and contains. it could become quite a
useful library, if it meets the needs of people.

>>
>>>> 2. Your code doesn't trivially exit on the contains
>>> It used to be that it did (in the version that didn't use
>>> Boost.Fusion). I can try and optimize still in cases when it
>>> encounters a '0' bit in between the hash function results.
>>>
>> this is what i see:
>>
>> bool contains(const_ref input) const {
>> using fusion::for_each;
>> typedef typename base::contains_impl contains_checker;
>> bool found = true;
>> for_each(
>> hash_functions,
>> contains_checker(bit_set, input, found)
>> );
>> return found;
>> }
>>
>> unless the contract for for_each has changed (which it hasn't both in stl
>> and fusion):
>>
>> 1. it doesn't exit
>
> Oh, but it does. Have you tried the tests? Remember this is
> Boost.Fusion --
> http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/algorithm/iteration/functions/for_each.html
> -- it iterates on the 'hash_functions' sequence which is finite.
>

"Your code doesn't trivially exit", I'll repeat it again "Your code
doesn't trivially exit" there is no talk about the for_each not
ending. Its just that because you really want to push the idea of using
metaprogramming, you've lost sight of the simplest most elegant
solution, which is:

for(i: 0,k-1)
if (0 == m[h[i](d)])
  return false;
return true;

>> 2. if you even bother to look at the machine code generated, you'd see that
>> even with the best available set of opt flags,
>> its still miles behind a simple loop with a break statement. You have a copy
>> constructor being called, do you know the cost that will incur when calling
>> contains 10^9 times?!
>>
>
> Really? The Boost.Fusion version actually unrolls the loop for you, so
> the cost of that for_each call is actually constant time. Sure,
> there's no short-circuit (yet) but then the cost is constant.
>

If you actually ran you're own code, you would realize that
unrolling a loop of calls to contains is far more beneficial than
unrolling the loop inside contains (also true for insert). Further
more if iteration over the data to be hashed was mixed in with the
hashing operations (eg: update hash_0-hash_k-1 with data_i) that would
also be far more efficient than "The Boost.Fusion version actually
unrolls the loop for you" philosophy

>>>> 3. Your solution doesn't provide a means to determine the various optimal
>>>> parameter combinations
>>> What do you mean by this?
>>>
>> There are various (many) use-cases for a BF
>>
>> 1. user provides max size and max fpp BF determines best k
>> 2. user provides max fpp and max expected insertion BF determine best k,m
>> 3. user provides max fpp and fuzzy k with limit BF determines m,k and
>> optimal fpp
>> 4. etc etc...
>>
>> Some of these can change/vary during runtime, so getters for fpp and
>> inserted count would be nice/good/expected.
>>
>
> This is too complex -- and has too many assumptions. First, it assumes
> that the user wants a self-(re)sizing bloom filter. Second it assumes
> that the user doesn't want to provide their own hash functions. Third
> it assumes that the user actually really cares if the bloom filter
> should be smarter than the user.
>
> In some cases (like the simplest cases I'm trying to solve for the
> moment) the current implementation seems sufficient. Of course, it's
> not fancy, and I don't intend for it to become too fancy. If you think
> you can implement something better, then please by all means show me
> code that I can deal with -- this is not said sarcastically, I really
> do want to see what you mean when you say all the above translated
> into code.
>

What I read from this is: I'm trying to propose and eventually
submit a library for which I don't fully understand its most basic
use-cases, furthermore I can't be bothered to look into it any
further without someone feeding me the answers - good attitude :)

>>>> 4. Your solution doesn't provide a means for counting modes
>>> Modes for bloom filters? I'm not sure I understand what you mean.
>>>
>> There are modes of BFs where instead of 1 bit per block you use more say 4
>> bits and instead of setting high you increment a counter.
>> this allows for a probabilistic determination of how many times an element
>> has been inserted into a BF, this is called a counting bloom filter.
>> this is another of MANY important use-case you have not covered.
>>
>
> I thought 'counting modes' meant 'Mode' in the statistical sense and
> getting these statistical modes.
>
> I would have understood it better if you meant I wasn't implementing
> counting bloom filters or bloomier filters or <insert new variation of
> a highfalutin name> bloom filter.
>
> Anyway, the point is to provide a bloom filter using the textbook
> definition. If there's a need for other kinds of bloom filters, like
> counting bloom filters, then that's another implementation that can be
> bundled in the library later -- *if* there is enough need for it, and
> *if* the library even gets into Boost just trying to solve the common
> (basic) cases. Of course the library is designed to evolve.
>

A smart representation of the internal data and some policies could
cater for both modes without having to create an xxx_bf class for
each new mode. Again as mentioned above requires understanding the
problem domain and not just 5mins on google.

>>>> 5. Your solution doesn't provide for erasures (do not confuse with
>>>> removal
>>>> of inserted items)
>>> .clear() is supposed to clear the internal bitset and clear the bloom
>>> filter.
>>>
>> No not clear. A bloom filter is to a degree error resilient/tolerant. if for
>> some reason part of your bloom filter data is damaged (n/w io, bad memory
>> etc...) you can erase that section (set all the bits high), doing so
>> depending on how large the section is will increase the fpp. For some uses
>> that is still quite acceptable, as the fpp has already been determined to be
>> orders of magnitude less than what can be tolerated.
>>
>
> If you want to go down this road, then you think about implementing
> your own allocator with error correction support and use that in the
> bloom_filter or any container that supports/uses custom allocators.
> Maybe you'll want an STL vector that also knows how to determine if
> part of the data is corrupted too?
>

An allocator allocates/deallocates memory thats all it does, nothing
else, this is clearly a job for the class holding the data. Again
I'm not going to feed you the answer as to why this is done as a
person interested in proposing a library you should at the very
least know about it and why its done.

>>
>>>> 6. Your solution doesn't provide for self superimposed compression mode
>>>> (similar to union but doubling of fpp)
>>>>
>>> The idea of the bloom_filter is to provide precisely that, a bloom
>>> filter -- any compression you might want to do in transmitting a
>>> bloom_filter would IMO be beyond the scope of the library. There has
>>> been some suggestions of implementing a compressed_bloom_filter that
>>> maintains a compressed bitset, but afar from that I must say I have no
>>> idea what self superimposed compression mode.
>>>
>> This is NOT data compression if you are providing union operations, then you
>> should also provide self compression, that is for example, OR'ing the 1st
>> half of the filter with the 2nd half, decreasing its size.
>>
>
> I don't understand why you want to do this when clearly the Bloom
> Filter is meant to have a fixed number of bits. Any variation to this
> in transit is beyond the scope of the implementation of the Bloom
> Filter.
>

Again if you can't see why decreasing the size of the data gradually
has a benefit then.....

>> Also another note looking at your code again (btw do you even use it?!),
>> you're equality test should also include the hashes and any other seeding
>> data you use to construct the BF.
>>
>
> I guess you missed the earlier discussion: the point was that since
> the hash functions are part of the bloom_filter's type, it's assumed
> that two bloom_filter instances of the same type have the same hash
> functions. If one bloom filter has a different encapsulated state
> compared to another bloom filter of the same type, then they are not
> equal -- simple, and easy to understand/implement. That's the point of
> equality testing after all.
>

Ok my bad probably didn't catch that, but to be fair all i did was
look at the code you've presented. Of all the information available
on this library that is the most pertinent, not the back and forths
on an ML

>> Another thing that is missing is serialization of the BF, I would like to be
>> able to save to file, or send it over a n/w. So a streaming operator or
>> something inline with boost.serialize would be good.
>>
>
> Yes, I know. And if you missed the earlier discussion, I'm working on it. ;-)
>

ok.

>> Another thing to consider is that a BF implementation should not be a means
>> to show-off one's abilities in the use of templates and meta-programming
>> paradigms but rather something that is efficient and lean. Think about
>> situations where cache hit-miss and coherency will be an issue (as these are
>> some of the more critical uses of BFs), look at the machine code that is
>> generated for some of the more critical routines insert/contains for various
>> archs and ask yourself, what it is you're "really" trying to achieve.
>>
>
> Hmmm... I guess you haven't followed the discussion at all in the list.

No you're right, all I need is to see code where you try and push the use of
complex template algos with in-line copy constructors when all you
need is a for-loop an index counter and a break. I apologize for not being
enlightened by the discussion on the ML.

> Let me rehash a bit:
>
> Of course, if the library under construction isn't suiting your needs,
> there should be room for improvement. But there's no need to be too
> fancy with the implementation/interface and being clever with the
> False Positive Probability

If you actually use BFs you'll soon realize you need all these
things. Otherwise all that it is, is an array with some hash
functions. There is nothing wrong with wanting the bare minimum and
most correct interface, isn't that one of the guiding principles of
boost?

Look I've been a bit brash, I apologize. I know you're keen to make a
contribution to boost and get your name up in lights, you've got that
function input iter thing (whatever that is) pending. Its always
nice to see people eager and keen contribute to open source stuff, so
don't let me or my whims discourage you. :D

Arash Partow
__________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk