Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Arash Partow (arash_at_[hidden])
Date: 2009-06-19 03:25:28


Milosz, I completely agree with everything you have said. In the event Dean decides its a bit much for him, I wouldn't mind teaming up and doing it properly.

Arash Partow
__________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net

Milosz Marian HULBOJ wrote:
> Dean Michael Berris wrote:
>> When you say correct set of interfaces, what do you need more than
>> 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
>> case of a counting bloom filter, maybe you'd have 'remove(...)' or
>> 'erase(...)'. Other than these functions I don't know what you mean by
>> most correct set of interfaces.
>
> For me it means the possibility of defining different variants of Bloom
> Filters (standard or extensions) in an easy way. Something more than
> text-book.
>
>>> There are various (many) use-cases for a BF
>>>
>>> 1. user provides max size and max fpp BF determines best k
>>> 2. user provides max fpp and max expected insertion BF determine best
>>> k,m
>>> 3. user provides max fpp and fuzzy k with limit BF determines m,k and
>>> optimal fpp
>>> 4. etc etc...
>>>
>>> Some of these can change/vary during runtime, so getters for fpp and
>>> inserted count would be nice/good/expected.
>>>
>>
>> This is too complex -- and has too many assumptions. First, it assumes
>> that the user wants a self-(re)sizing bloom filter. Second it assumes
>> that the user doesn't want to provide their own hash functions. Third
>> it assumes that the user actually really cares if the bloom filter
>> should be smarter than the user.
>
> 1) The mentioned assumptions don't imply the dynamic-self-resizing
> filter. But it would be nice to look at the recent papers about
> dynamic-self-resizing papers...
> 2) Why does it assume that the user doesn't want custom hash functions?
> 3) It's not about being smarter, but about providing in convenient form
> some of the theory/math behind the filters.
>
>> In some cases (like the simplest cases I'm trying to solve for the
>> moment) the current implementation seems sufficient. Of course, it's
>> not fancy, and I don't intend for it to become too fancy. If you think
>> you can implement something better, then please by all means show me
>> code that I can deal with -- this is not said sarcastically, I really
>> do want to see what you mean when you say all the above translated
>> into code.
>
> And I am implying that the simplest and naïve case is far from the real
> world applications. And one should not focus on the simplest case, but
> attempt to provide a good solution.
>
>
>> Anyway, the point is to provide a bloom filter using the textbook
>> definition. If there's a need for other kinds of bloom filters, like
>> counting bloom filters, then that's another implementation that can be
>> bundled in the library later -- *if* there is enough need for it, and
>> *if* the library even gets into Boost just trying to solve the common
>> (basic) cases. Of course the library is designed to evolve.
>
> Yet again - the simplest case. If the library is designed to evolve, one
> should take much care when designing the interfaces. It's not easy to
> change afterwards. Providing counting modes or full-fledged counting
> filter should not be a problem.
>
>>> No not clear. A bloom filter is to a degree error resilient/tolerant.
>>> if for
>>> some reason part of your bloom filter data is damaged (n/w io, bad
>>> memory
>>> etc...) you can erase that section (set all the bits high), doing so
>>> depending on how large the section is will increase the fpp. For some
>>> uses
>>> that is still quite acceptable, as the fpp has already been
>>> determined to be
>>> orders of magnitude less than what can be tolerated.
>>>
>>
>> If you want to go down this road, then you think about implementing
>> your own allocator with error correction support and use that in the
>> bloom_filter or any container that supports/uses custom allocators.
>> Maybe you'll want an STL vector that also knows how to determine if
>> part of the data is corrupted too?
>
> Please read the paper that I have mentioned in one of the earlier posts.
> There are also uses of the intentional introduction of the errors into
> the BF.
>
>>> This is NOT data compression if you are providing union operations,
>>> then you
>>> should also provide self compression, that is for example, OR'ing the
>>> 1st
>>> half of the filter with the 2nd half, decreasing its size.
>>>
>>
>> I don't understand why you want to do this when clearly the Bloom
>> Filter is meant to have a fixed number of bits. Any variation to this
>> in transit is beyond the scope of the implementation of the Bloom
>> Filter.
>
> Again, there is some motivation and research behind compression.
> Tradeoff between size, time of lookup and FPR is one of the reasons.
>
>
>> Let me rehash a bit:
>> 4. The implementation is meant to be simple, portable across
>> platforms, and C++ standard-compliant. When you deal with these
>> issues, it's best you don't think about cache hit-miss and coherency
>> issues IMO. At any rate if you want to have better control of the
>> memory allocation and issues like these, you can control them by: a)
>> using your favorite STL implementation of an std::bitset in the case
>> of the statically-sized filter or b) by providing your own allocator
>> and using the "correct" block type in the case of a
>> runtime-constructed Bloom Filter.
>
> And the implementation can be perfectly C++ compliant and portable with
> including the variants of the Bloom filter that pay attention to the
> cache hit-miss ratio and coherency. And it is not only the issue of
> memory allocation itself but it has much to do with the internal working
> of the filter.
>
>> Of course, if the library under construction isn't suiting your needs,
>> there should be room for improvement. But there's no need to be too
>> fancy with the implementation/interface and being clever with the
>> False Positive Probability -- you can control this easily with
>> choosing/implementing your own hash functions and the size of the
>> underlying bitset.
>
> You can do more than choosing a custom hash and size.
>
> Regards,
> Milosz


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk