Boost logo

Boost :

Subject: Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-06-19 03:56:47


On Fri, Jun 19, 2009 at 2:59 PM, Milosz Marian HULBOJ<mhulboj_at_[hidden]> wrote:
> Dean Michael Berris wrote:
>>
>> When you say correct set of interfaces, what do you need more than
>> 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
>> case of a counting bloom filter, maybe you'd have 'remove(...)' or
>> 'erase(...)'. Other than these functions I don't know what you mean by
>> most correct set of interfaces.
>
> For me it means the possibility of defining different variants of Bloom
> Filters (standard or extensions) in an easy way. Something more than
> text-book.
>

Right, then I guess the Concepts documentation should be enough --
saying that a type models the BloomFilter concept when they support
the following restrictions/interface: ...

>>> There are various (many) use-cases for a BF
>>>
>>> 1. user provides max size and max fpp BF determines best k
>>> 2. user provides max fpp and max expected insertion BF determine best k,m
>>> 3. user provides max fpp and fuzzy k with limit BF determines m,k and
>>> optimal fpp
>>> 4. etc etc...
>>>
>>> Some of these can change/vary during runtime, so getters for fpp and
>>> inserted count would be nice/good/expected.
>>>
>>
>> This is too complex -- and has too many assumptions. First, it assumes
>> that the user wants a self-(re)sizing bloom filter. Second it assumes
>> that the user doesn't want to provide their own hash functions. Third
>> it assumes that the user actually really cares if the bloom filter
>> should be smarter than the user.
>
> 1) The mentioned assumptions don't imply the dynamic-self-resizing filter.
> But it would be nice to look at the recent papers about
> dynamic-self-resizing papers...

Actually, it does -- it's implying that you just give a false positive
probability and it computes the number of hash functions (k) and the
size of the container (m). Maybe not re-sizing (I've read a few papers
that imply or suggest the possibility and practicability of that) but
self-sizing at construction is IMO too fancy. Maybe a convenience
function and meta-function would work:

  template <double FPP> struct m_and_k_computer;

  m_and_k_computer<0.09>::k; // yields the number of hash functions
  m_and_k_computer<0.09>::m; // yields the number of bits used to represent

> 2) Why does it assume that the user doesn't want custom hash functions?

Because none of the numbered list earlier presented mentioned the
possibility of a custom hash function. ;-)

> 3) It's not about being smarter, but about providing in convenient form some
> of the theory/math behind the filters.
>

Sure, but it's completely possible to layer these things on top of the
underlying data structure which is the *_bloom_filter. I mean
computing the false positive probability is independent of the actual
workings of a bloom filter. I can imagine a layer of convenience
functions (factories really) that does everything above and uses the
current bloom filter implementation as the resulting type.

What I'm really saying is, that these things (ways to compute K,M,
having different hash functions, etc.) are nice to have, but is it
really part of the basic notion of what a Bloom Filter is or how it
actually behaves? There is an algorithmically simple template for the
'insert' and 'contains' operations and sure there can be more
optimizations at that level -- and I'm definitely welcome to
suggestions. However now the question becomes: should these things be
part of a Bloom Filter library? Maybe yes, maybe no -- maybe like the
type of elements you represent and your actual requirements in the
application, it should be determined by the users.

>> In some cases (like the simplest cases I'm trying to solve for the
>> moment) the current implementation seems sufficient. Of course, it's
>> not fancy, and I don't intend for it to become too fancy. If you think
>> you can implement something better, then please by all means show me
>> code that I can deal with -- this is not said sarcastically, I really
>> do want to see what you mean when you say all the above translated
>> into code.
>
> And I am implying that the simplest and naïve case is far from the real
> world applications. And one should not focus on the simplest case, but
> attempt to provide a good solution.
>

Hmmm... I'm providing a simple means of defining a bloom filter given
the specifications. Now if you need a certain FPP met, then maybe you
should be the one computing K and M -- if you want to use special hash
functions, then maybe you should provide them instead of using the
defaults.

The defaults should make it convenient to use, but of course if
genericness of the simplest case doesn't work for you, then you have
the option to specialize to your own requirements.

>
>> Anyway, the point is to provide a bloom filter using the textbook
>> definition. If there's a need for other kinds of bloom filters, like
>> counting bloom filters, then that's another implementation that can be
>> bundled in the library later -- *if* there is enough need for it, and
>> *if* the library even gets into Boost just trying to solve the common
>> (basic) cases. Of course the library is designed to evolve.
>
> Yet again - the simplest case. If the library is designed to evolve, one
> should take much care when designing the interfaces. It's not easy to change
> afterwards. Providing counting modes or full-fledged counting filter should
> not be a problem.
>

I'm a little confused here: what else do you need from a bloom filter
other than 'bf.insert(x)' and 'bf.contains(x)' ? This is the basic
interface to any bloom filter IIRC, and I don't get why sticking to
that will necessarily make it harder to change in the future.

If you're talking about the type interface, it's going to be way
harder to change if you have a single bloom_filter type that has loads
and loads of template parameters that define the body of the
bloom_filter IMO. This is why I chose to separate the
static_bloom_filter (statically-sized bloom filter that is) and the
runtime-sized bloom_filter templates. If there's a special kind of
bloom_filter that needs to be implemented in the future (like a
counting bloom filter, and the compressed bloom filter) then that's a
different type but with similar interfaces to the basic bloom_filter.

>>> No not clear. A bloom filter is to a degree error resilient/tolerant. if
>>> for
>>> some reason part of your bloom filter data is damaged (n/w io, bad memory
>>> etc...) you can erase that section (set all the bits high), doing so
>>> depending on how large the section is will increase the fpp. For some
>>> uses
>>> that is still quite acceptable, as the fpp has already been determined to
>>> be
>>> orders of magnitude less than what can be tolerated.
>>>
>>
>> If you want to go down this road, then you think about implementing
>> your own allocator with error correction support and use that in the
>> bloom_filter or any container that supports/uses custom allocators.
>> Maybe you'll want an STL vector that also knows how to determine if
>> part of the data is corrupted too?
>
> Please read the paper that I have mentioned in one of the earlier posts.
> There are also uses of the intentional introduction of the errors into the
> BF.
>

Hmmm... What I'm implying here is that the "generic" data structure
should not be bothered with details like these of whether the memory
gets corrupted, etc. IMO you deal with network transmission errors at
the network transmission layer, the memory errors at the appropriate
level, and appropriate errors at the appropriate level of detail. Now
if we're going to start writing self-aware containers, then maybe
there should be containers that can actually heal themselves in case
of memory corruption. Those would be nice to have, but I doubt they'd
be a lot better than the containers we already have at the moment
which *aren't* storage/network/self-aware except in these special
cases when memory gets corrupted.

>>> This is NOT data compression if you are providing union operations, then
>>> you
>>> should also provide self compression, that is for example, OR'ing the 1st
>>> half of the filter with the 2nd half, decreasing its size.
>>>
>>
>> I don't understand why you want to do this when clearly the Bloom
>> Filter is meant to have a fixed number of bits. Any variation to this
>> in transit is beyond the scope of the implementation of the Bloom
>> Filter.
>
> Again, there is some motivation and research behind compression. Tradeoff
> between size, time of lookup and FPR is one of the reasons.
>

Yes, I understand this. Now if you need/want a compressed bloom
filter, maybe a compressed_bloom_filter type would be more
appropriate. No?

>
>> Let me rehash a bit:
>> 4. The implementation is meant to be simple, portable across
>> platforms, and C++ standard-compliant. When you deal with these
>> issues, it's best you don't think about cache hit-miss and coherency
>> issues IMO. At any rate if you want to have better control of the
>> memory allocation and issues like these, you can control them by: a)
>> using your favorite STL implementation of an std::bitset in the case
>> of the statically-sized filter or b) by providing your own allocator
>> and using the "correct" block type in the case of a
>> runtime-constructed Bloom Filter.
>
> And the implementation can be perfectly C++ compliant and portable with
> including the variants of the Bloom filter that pay attention to the cache
> hit-miss ratio and coherency. And it is not only the issue of memory
> allocation itself but it has much to do with the internal working of the
> filter.
>

Yes, I agree. Now what do these special bloom filters look like? How
do they behave? Can I possibly implement all these different kinds of
special bloom filters? Maybe yes, maybe no, but what I'm saying is
that for the most part the goal is to make using Bloom Filters simple
and easy to use. Just like the STL vector, some people have chosen to
not use it for a myriad of reasons just like the ones you mention
(cache coherency, etc.), but for the simplest cases where you need a
homogeneous container I believe the STL vector is enough. Same
reasoning goes: if you need a simple bloom filter, then the candidate
Boost.Bloom_filter might work for you. If not, maybe a special one in
the library would work better for you. Or, you just might have to
write one for your specific needs.

>> Of course, if the library under construction isn't suiting your needs,
>> there should be room for improvement. But there's no need to be too
>> fancy with the implementation/interface and being clever with the
>> False Positive Probability -- you can control this easily with
>> choosing/implementing your own hash functions and the size of the
>> underlying bitset.
>
> You can do more than choosing a custom hash and size.
>

Yes I know this -- but the point I was trying to make was that _for
the simplest cases_ and for the most part, the simplicity of the
proposed solution should do for the simplest cases.

HTH

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk