![]() |
Boost : |
From: Peter Turcan (peterturcan_at_[hidden])
Date: 2025-05-19 20:36:15
Joaquin - thanks for your comments on my comments, here are my comments on
your comments on my comments:
>In which ways do you think that this primer does not fulfill the goals of
your proposed "Introduction" section?
My issue with your Primer is that it explains what a Bloom filter is, not
what it does - start the *Introduction *by answering the question "what is
the problem that a Bloom Filter solves?". Don't move onto the "is" until
you have answered "why should I be interested in this?".
A Bloom filter *is* a probabilistic data structure where inserted elements
can be looked up with 100% accuracy, whereas looking up for a non-inserted
element may fail with some probability called the filterâs *false positive
rate* or FPR. The tradeoff here is that Bloom filters occupy much less
space than traditional non-probabilistic containers (typically, around 8-20
bits per element) for an acceptably low FPR. The greater the filterâs
*capacity* (its size in bits), the lower the resulting FPR.
My 2c is that an *Introduction *section should describe the purpose and use
cases of a library that is understandable to a program manager or analyst
who is researching the libraries to consider for their project but who will
not be as technical as their developers. It is people who need assistance
that turn to overview documentation - typically experts in the field go
straight to the *Reference*.
If fraud detection is not the main use of Bloom Filters, what is? Be bold
and articulate use cases. Every use-case you mention will pull in users.
> Have you written this proposed introduction based on your reading of my
docs or from some other sources?
Other sources, for the reason described above.
> I can try to organize the tutorial as part of a narrative where a full
example is built. I have some problems though with that, as some parts,
notably the
one where I describe the different configuration options, don't look to me
amenable to that approach.
Configuration options belong in the *Getting Started/Installation*
sections, a *Tutorial *should step by step. Consider deciding on the
appropriate configuration for your tutorial, and run with that. The purpose
of a first tutorial is to provide the user with a "feel good" experience
that encourages them to go further - it need not be complete in any sense.
> Do you have any favorite Boost lib referencewise?
Take a look at this ref entry from Boost.Geometry - I don't have to look
out of this page to understand what it does, and how to use it:
Users of your library will jump straight into *Reference *pages, and
straight back out again. Best to fully describe the purpose and
functionality of each construct without requiring leaving that page.
Repetition within a *Reference *is not necessarily much of an issue as they
are not read front to back like a book.
Perhaps identify the reference entries that do the heavy-lifting, and
consider expanding them so that they are self-contained explanations with
examples and diagrams if needed.
Hope this is useful and good luck with this effort!
- Peter Turcan
On Sat, May 17, 2025 at 2:26â¯AM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:
> El 17/05/2025 a las 0:42, Peter Turcan via Boost escribió:
> > JoaquÃn, Arnaud,
> >
> > Thanks for submitting and managing the review of this library. Here is
> my
> > review of the documentation. Good to see plenty of links, tables and
> > structure to the reference, diagrams and equations, and performance
> > benchmarks.
> Hi Peter, thank you for your documentation review!
> > I think the doc could use a much friendlier introduction that explains
> the
> > purpose of these useful filters to all C++ developers.
> In the current documentation, this is the intended purpose of the "Primer"
> section. In my mind, there are two big groups in the audience:
>
> * People who already know what a Bloom filter is and want to go straight
> into what this particular library offers.
> * People who don't know about Bloom filters.
>
> So, the primer is provided as a sort of skippable section depending on the
> previous background of the reader. In which ways do you think that this
> primer does not fullfill the goals of your proposed "Introduction" section?
>
> > I suggest something like this:
> >
> > *Introduction*
> > [...]
> >
> > Now, an attempt is made to access the financial website that is protected
> > by a pattern-matching Bloom filter. In our example, the Bloom filter
> could
> > simply be:
> >
> > If (request-name contains "e") then return positive, else return
> negative;
> This is not how Bloom filters work. A Bloom filter does not process
> elements
> based on (implicit or explicit) patterns of the input data. Instead, for
> each
> element inserted into the filter, the values of the associated hash
> functions
> as applied to the element are used to calculate a number of bit positions
> that are then set to 1. The statistical magic is that when querying for an
> element that has _not_ been inserted, the probability that all the
> associated
> bits are set to 1 is very small --though not zero, hence the false
> positive rate
> thing.
>
> If the primer has led you to this exposition of Bloom filters, then I guess
> it's not doing an excellent job at explaining the data structure :/ Have
> you written
> this proposed introduction based on your reading of my docs or from some
> other sources?
>
> > [...]
> >
> > Whereas fraud detection is the top use of Bloom filters, [...]
> I woulnd't say fraud detection is _the top use_ of Bloom filters, though
> it's certainly a reasonable application of this data structure.
> > they have proved
> > useful in such situations as checking genome strings against a database
> of
> > DNA sequences. These checks can be time intensive due to the length of
> DNA
> > strings[...]
>
> FWIW, there's a genomics-related example at
>
> https://github.com/joaquintides/bloom/blob/develop/example/genome.cpp
>
> > [...]
> >
> > Follow the intro with a *Getting Started* section:
> >
> > *Getting Started*
> > - describe the *Requirements*, *Installation *and a trivial "hello world"
> > example of the use of the library.
> Yes, I can plug this info as a subsection of the introduction. Note that
> installation
> is basically non-existent, as the docs assume that the library is
> already part of
> Boost, and, being header-only, there's no additional build step.
> > *Tutorial*
> > - this section is "tutorial" in name only - the section should be a
> > numbered step-by-step procedure for the reader to follow - that results
> in
> > a small but meaningful example working, with the expected output.
> Currently
> > it is not certain what to do.
> I can try to organize the tutorial as part of a narrative where a full
> example
> is built. I have some problems though with that, as some parts, notably the
> one where I describe the different configuration options, don't look to me
> amenable to that approach.
> > [...]
> >
> > *Reference*
> > - I found it difficult to follow the Reference in the sense of "what am I
> > looking at?". Best to be specific as to what is a class, template,
> method,
> > property, field, macro etc. AND critically explain the use-case of each
> of
> > these constructs in the intro or description of the construct - what is
> the
> > scenario that this construct applies to?
> The reference is built in standardese format, so I guess it looks pretty
> straightforward to me cause I've wasted the best years of my youth reading
> the standard. Do you have any favorite Boost lib referencewise?
> > Some of the reference I didn't really understand - such as:
> >
> > A compile-time std::size_t value indicating the number of (not
> necessarily
> > distinct) bits set/checked per operation.
> >
> > - what does *not necessarily distinct* mean?
> The way a Bloom filter works, element insertion translates to a given
> numbers of bits being set to 1. But, as their positions are selected
> pseudorandomly and independently, it may be the case that some
> of the positions coincide.
> > *Appendix A*
> > The equations are somewhat daunting and difficult to determine what is
> good
> > practice and what is not. Perhaps add text to describe best practices in
> > certain situations.
> Sorry, I'm not getting you here. How are mathematical equations related
> to best practices? In any case, the appendix is provided precisely as an
> appendix because it's not essential to understanding or using the library
> --only the mathematically inclined will be looking at this.
> > Perhaps add an *Acknowledgements *section at the end - authors, testers,
> > motivation, etc.
> Absolutely, on my todo list.
> > *Summary*
> > I would most like to see a friendly introduction, a step-by-step
> tutorial,
> > and a clearer reference.
> >
> > The case for Bloom filters seems strong, and good luck with the library.
>
> Again, thanks for your review and useful tips.
>
> Joaquin M Lopez Munoz
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk