|
Boost : |
Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-09 11:47:23
Hi Manuel,
On Tue, Jun 9, 2009 at 8:45 PM, Manuel
Fiorelli<manuel.fiorelli_at_[hidden]> wrote:
> About the implementation:
> If I understand your implementation, you do something like that:
> given a value V,
> you compute a number X (say the digest of V),
> then you compute K different values applying the transformation (X +
> offset) mod(M)
>
Right. This naive filter will almost always create K consecutive
indexes (or results) from the hash functions.
> I don't know if this produces k hash functions suitable for bloom
> filters, but, why don't you use Boost.hash to compute X? Also, it
> would be easier to define the requirements on the Input type.
>
I've considered using Boost.Hash to compute the hash values -- the
naive implementation up there is meant to be a sample. I'm in the
process of (re)thinking how the functions are made part of the type
instead. Maybe instead of just Boost.Array's at construction time I'll
consider Fusion-compatible sequences that are part of the signature
(thus the type of the filter).
> About the documentation:
> I think that the following paragraph in the documentation must be reformulated
>
> <snip>
> To find out whether we've encountered a certain URL before, we simply
> pass the URL to the same three hash functions. If we see that all the
> positions returned by the three hash functions are all 1, then we
> might have seen the URL before -- this is called a false positive.
> </snip>
>
> indeed, you cannot say that the URL is a false positive, since it
> could be equal to an URL stored previously.
>
Right, I need to make that a little more clear. Thanks for pointing it out.
> IMHO, you should describe the mathematical principles in more detail
> and, moreover, you should discuss in more detail the false positives
> and how to control their rate.
>
I agree. I'll work on it some more and gather more resources worth referencing.
> Even if it isn't mandatory, it would be nice if you write your
> documentation in QuickBook format, since it will uniform both the
> documentation toolchain and the documentation look.
I chose to write in ReStructuredText first to make the documentation
generation cycle faster -- it should be easier later to create
Quickbook documentation from the RST sources if I remember my
quickbook correctly. ;-)
>
> Also, don't forgive to synchronize the documentation with the current interface.
>
Right. More work for me then. ;-)
Thanks for the insights! :-D
-- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk