Boost logo

Boost :

Subject: Re: [boost] [Review:Algorithms] Comments about search algorithms
From: Marshall Clow (mclow.lists_at_[hidden])
Date: 2011-09-28 17:03:08


On Sep 28, 2011, at 1:45 PM, dpxguard-boost_at_[hidden] wrote:

>>> And a question for Marshall: I have noticed that inside the search.hpp file you have a requirement for random-access iterators regarding the Knuth-Pratt-Morris searching algorithm as well. Is this for both the pattern and the corpus?
>
>> Yes, and the here's the reason for that.
>
>> There are basically two steps to the KMP algorithm (and frankly, the other two as well), and those two steps are coded in the do_search method.
>
>> Given a potential match point (i.e position in the pattern and the corpus):
>> 1) test and advance until you find a mismatch or come to the end (and report success)
>> 2) Find the next potential match point.
>
>> Step #1 can be done with ForwardIterators (or Bidirectional, in the case of Boyer-Moore), but in the second step, you have to be able to advance to arbitrary location in both the pattern and the corpus:
>
>> // Figure out where to start searching again
>> 544: match_start += idx - skip_ [ idx ];
>> 545: idx = skip_ [ idx ] >= 0 ? skip_ [ idx ] : 0;
>
>> In this case, idx is the position in the pattern.
>
>> Hm.. if the skip table stored iterators instead of indexes, you could just use ForwardIterators.....
>
> Using iterators instead of indexes is not a bad idea, but even with indexes, what is the reason for needing random-access iterators for the corpus? Typically in good-suffix shift we are shifting the pattern and not the corpus! I will try to study your code asap in order to become more specific.

Unless I'm sadly mistaken (which has been known to happen), we are shifting both in the pattern and the corpus.
In that code, "match_start" is the place where the match might be, and "idx" is where we want to start comparing.
If idx == 0, then we start comparing from the beginning of the pattern, but otherwise, from somewhere in the middle.

> I ensure you that requiring no more than forward iterators for both the pattern and the corpus is both feasible and important. And it is particularly important, since in the random-access iterators case BM will almost always be faster, hence the usefulness of a KMP implementation, which requires random-access iterators, is rather limited.

Understood.

-- Marshall

Marshall Clow Idio Software <mailto:mclow.lists_at_[hidden]>

A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait).
        -- Yu Suzuki


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk