Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Milosz Marian HULBOJ (mhulboj_at_[hidden])
Date: 2009-06-19 02:59:42


Dean Michael Berris wrote:
> When you say correct set of interfaces, what do you need more than
> 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
> case of a counting bloom filter, maybe you'd have 'remove(...)' or
> 'erase(...)'. Other than these functions I don't know what you mean by
> most correct set of interfaces.

For me it means the possibility of defining different variants of Bloom
Filters (standard or extensions) in an easy way. Something more than
text-book.

>> There are various (many) use-cases for a BF
>>
>> 1. user provides max size and max fpp BF determines best k
>> 2. user provides max fpp and max expected insertion BF determine best k,m
>> 3. user provides max fpp and fuzzy k with limit BF determines m,k and
>> optimal fpp
>> 4. etc etc...
>>
>> Some of these can change/vary during runtime, so getters for fpp and
>> inserted count would be nice/good/expected.
>>
>
> This is too complex -- and has too many assumptions. First, it assumes
> that the user wants a self-(re)sizing bloom filter. Second it assumes
> that the user doesn't want to provide their own hash functions. Third
> it assumes that the user actually really cares if the bloom filter
> should be smarter than the user.

1) The mentioned assumptions don't imply the dynamic-self-resizing
filter. But it would be nice to look at the recent papers about
dynamic-self-resizing papers...
2) Why does it assume that the user doesn't want custom hash functions?
3) It's not about being smarter, but about providing in convenient form
some of the theory/math behind the filters.

> In some cases (like the simplest cases I'm trying to solve for the
> moment) the current implementation seems sufficient. Of course, it's
> not fancy, and I don't intend for it to become too fancy. If you think
> you can implement something better, then please by all means show me
> code that I can deal with -- this is not said sarcastically, I really
> do want to see what you mean when you say all the above translated
> into code.

And I am implying that the simplest and naïve case is far from the real
world applications. And one should not focus on the simplest case, but
attempt to provide a good solution.

> Anyway, the point is to provide a bloom filter using the textbook
> definition. If there's a need for other kinds of bloom filters, like
> counting bloom filters, then that's another implementation that can be
> bundled in the library later -- *if* there is enough need for it, and
> *if* the library even gets into Boost just trying to solve the common
> (basic) cases. Of course the library is designed to evolve.

Yet again - the simplest case. If the library is designed to evolve, one
should take much care when designing the interfaces. It's not easy to
change afterwards. Providing counting modes or full-fledged counting
filter should not be a problem.

>> No not clear. A bloom filter is to a degree error resilient/tolerant. if for
>> some reason part of your bloom filter data is damaged (n/w io, bad memory
>> etc...) you can erase that section (set all the bits high), doing so
>> depending on how large the section is will increase the fpp. For some uses
>> that is still quite acceptable, as the fpp has already been determined to be
>> orders of magnitude less than what can be tolerated.
>>
>
> If you want to go down this road, then you think about implementing
> your own allocator with error correction support and use that in the
> bloom_filter or any container that supports/uses custom allocators.
> Maybe you'll want an STL vector that also knows how to determine if
> part of the data is corrupted too?

Please read the paper that I have mentioned in one of the earlier posts.
There are also uses of the intentional introduction of the errors into
the BF.

>> This is NOT data compression if you are providing union operations, then you
>> should also provide self compression, that is for example, OR'ing the 1st
>> half of the filter with the 2nd half, decreasing its size.
>>
>
> I don't understand why you want to do this when clearly the Bloom
> Filter is meant to have a fixed number of bits. Any variation to this
> in transit is beyond the scope of the implementation of the Bloom
> Filter.

Again, there is some motivation and research behind compression.
Tradeoff between size, time of lookup and FPR is one of the reasons.

> Let me rehash a bit:
> 4. The implementation is meant to be simple, portable across
> platforms, and C++ standard-compliant. When you deal with these
> issues, it's best you don't think about cache hit-miss and coherency
> issues IMO. At any rate if you want to have better control of the
> memory allocation and issues like these, you can control them by: a)
> using your favorite STL implementation of an std::bitset in the case
> of the statically-sized filter or b) by providing your own allocator
> and using the "correct" block type in the case of a
> runtime-constructed Bloom Filter.

And the implementation can be perfectly C++ compliant and portable with
including the variants of the Bloom filter that pay attention to the
cache hit-miss ratio and coherency. And it is not only the issue of
memory allocation itself but it has much to do with the internal working
of the filter.

> Of course, if the library under construction isn't suiting your needs,
> there should be room for improvement. But there's no need to be too
> fancy with the implementation/interface and being clever with the
> False Positive Probability -- you can control this easily with
> choosing/implementing your own hash functions and the size of the
> underlying bitset.

You can do more than choosing a custom hash and size.

Regards,
Milosz


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk