Boost logo

Boost :

Subject: Re: [boost] Interest in a Bloom Filter?
From: Manuel Fiorelli (manuel.fiorelli_at_[hidden])
Date: 2009-06-16 03:31:23


2009/6/15 Milosz Marian HULBOJ <mhulboj_at_[hidden]>:
> Dean Michael Berris wrote:
>>
>> Hi Milosz,
>>
>> On Mon, Jun 15, 2009 at 2:49 PM, Milosz Marian HULBOJ<mhulboj_at_[hidden]>
>> wrote:
>>>
>>> - It would be nice to have the option of using `double-hashing' schema,
>>> where all the hashes are generated according to:
>>> hash(i) = h1(x) + i*h2(x)
>>> or the squared/cubed variant:
>>> hash(i) = h1(x) + i*h2(x) + i*i (*i)
>>>
>>
>> This is interesting. I'd think this would have to be implemented by
>> the user of the bloom_filter in the HashFunctions template parameter.
>>
>>> - one hash value can be potentially used to generate many hash values.
>>> For
>>> example, if you need 16-bit hashes, then from the hash function producing
>>> 32-bit hash-value you could get 2 values. This can be in turn combined
>>> with
>>> the previous point - double hashing.
>
> Hi Dean,
>
> The bottom line of the two points above is that in order to hash for the
> bloom filter (in the double-hashing schema) one only needs to:
> - invoke hash function once
> - split the bits and do few multiplications and additions
> Still I didn't have time to look into yours implementation to see whether
> the above can be used seamlessly.
>
> There are also variants of double hashing to be more cache-line friendly,
> but personally the idea presented above was sufficient for my needs (for the
> hashing functions I've been using the TR1 hash and murmur hash).

I suspect that this approach cannot be implemented with the current
design, because of the
management of the hash functions.

The template class bloom_filter takes a parameter, called
HashFunctions, which describes
the K independent hash functions to instantiate and use.

There is no way to express that the i-th hash depends on the previous
hashes other than recompute them inside the i-th hash function.

I suspect that the interface could be changed to be like the following

template<class Input, class HashingPolicy, class Hasher>
class bloom_filter;

The Hashing policy must compute the K hash functions, using the value produced
by the hash function described by the hasher.

The Hashing policy can use the Hasher many times and can split the
values as you describe.

If we want to accomodate an HashingPolicy using M different hash
functions, Hasher could be
a fusion vector, possibly with less elements than K.

PS: I don't know if this less elegant design could really determine
any advantage, or even if it is buggy :-)

Manuel Fiorelli


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk