Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-09 11:47:23


Hi Manuel,

On Tue, Jun 9, 2009 at 8:45 PM, Manuel
Fiorelli<manuel.fiorelli_at_[hidden]> wrote:
> About the implementation:
> If I understand your implementation, you do something like that:
> given a value V,
> you compute a number X (say the digest of V),
> then you compute K different values applying the transformation (X +
> offset) mod(M)
>

Right. This naive filter will almost always create K consecutive
indexes (or results) from the hash functions.

> I don't know if this produces k hash functions suitable for bloom
> filters, but, why don't you use Boost.hash to compute X? Also, it
> would be easier to define the requirements on the Input type.
>

I've considered using Boost.Hash to compute the hash values -- the
naive implementation up there is meant to be a sample. I'm in the
process of (re)thinking how the functions are made part of the type
instead. Maybe instead of just Boost.Array's at construction time I'll
consider Fusion-compatible sequences that are part of the signature
(thus the type of the filter).

> About the documentation:
> I think that the following paragraph in the documentation must be reformulated
>
> <snip>
> To find out whether we've encountered a certain URL before, we simply
> pass the URL to the same three hash functions. If we see that all the
> positions returned by the three hash functions are all 1, then we
> might have seen the URL before -- this is called a false positive.
> </snip>
>
> indeed, you cannot say that the URL is a false positive, since it
> could be equal to an URL stored previously.
>

Right, I need to make that a little more clear. Thanks for pointing it out.

> IMHO, you should describe the mathematical principles in more detail
> and, moreover, you should discuss in more detail the false positives
> and how to control their rate.
>

I agree. I'll work on it some more and gather more resources worth referencing.

> Even if it isn't mandatory, it would be nice if you write your
> documentation in QuickBook format, since it will uniform both the
> documentation toolchain and the documentation look.

I chose to write in ReStructuredText first to make the documentation
generation cycle faster -- it should be easier later to create
Quickbook documentation from the RST sources if I remember my
quickbook correctly. ;-)

>
> Also, don't forgive to synchronize the documentation with the current interface.
>

Right. More work for me then. ;-)

Thanks for the insights! :-D

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk