Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2010-04-08 12:24:13


Thomas Klimpel wrote:
> Phil Endecott wrote:
>> I think that having a test data set in mind in advance would make this
>> much more concrete to reason about. Does anyone have any suggestions?
>
> Some people regard using an algorithm with a really bad complexity as
> some sort of bug, no?

Yes, indeed!

> So the question would be how this "bug" can be exposed, in case it
> is really a bug. One suggestion in this direction might be a denial
> of service attack. But in my own experience, such "bugs" tend to
> cause damage in more subtle and unexpected ways. Perhaps somebody
> can answer me the opposite question: What is the advantage of
> having subtle bugs in your algorithms that only wait to bite you
> when you expect it the least?

Well the not-100%-serious answer would be "because they might be faster
the rest of the time".

I think these algorithms would be useful contributions to Boost. But
because their complexity benefits may not be seen for common usage
patterns (e.g. short search patterns), I think that studying their
actual performance with some realistic test data (and not synthetic
pseudo-random patterns!) should be a part of the project.

I also wonder about other variations, e.g. avoiding recomputing the
table in the KMP or Boyer-Moore algorithms for repeated searches with
the same pattern:

string text = content_of_file("long_text.txt");
kmp_pattern pat("search for this"); // compute the table here.
string::const_iterator i = text.begin();
while (i != text.end()) {
   i = kmp_search(i,text.end(),pat);
   ...
}

If N = length of text, M = number of matches and P = length of pattern,
I think this reduces the complexity from O(N+MP) to O(N+P).

Do this with a TMP and a compile-time pattern, and you've got something expressive-like.

But in my experience, when I have had to do fast search it has always
been repeated search for different patterns and so I've used some sort
of an index. Stephan mentioned suffix arrays and I think that would be
an even more valuable contribution.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk