Boost logo

Boost :

Subject: Re: [boost] [Review:Algorithms] Comments about search algorithms
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-09-28 09:36:20


dpxguard-boost_at_[hidden] wrote:
>> Phil Endecott wrote:
>> > - A user faced with these three (or four) algorithms will need to?
>> > choose one. ?The docs should offer some guidance, e.g. benchmarks or?
>> > rules-of-thumb, about which to use.
>
>> In particular, I've not yet found a case where KMP is the best choice. ?
>> BM seems better in all the cases that I've tried and has the same or?
>> better complexity bounds. ?(Well, it might perform slightly better when?
>> the pattern length is 1 but in that case, std::find could be used.) ?
>> Does anyone know of any cases where KMP is likely to be better than?
>> BM? ?(If we can't find any, is it worth including it?)
>
> I will try to give a complete answer...
>
> The most notable (and common) techniques for improving the sequence
> searching performance are the bad-character shift (occurrence shift)
> and the good-suffix shift (matching shift). The Boyer-Moore algorithm
> implements both these techniques, while the Boyer-Moore-Horspool
> only implements the bad-character shift and the KMP is merely based
> on the good-suffix shift. (Although the Boost.algorithm documentation
> only states that Horspool "is a refinement of the Boyer-Moore algorithm
> that trades space for time", this is by no means the most notable
> difference between these two algorithms.)
>
> In general, bad-character shift performs very good when the
> search-pattern is relatively small and the alphabet relatively large.
> On the other hand, good-suffix shift helps more when the pattern is
> relatively large and the alphabet small. In the most common cases of
> text searching, the patterns are usually small and the alphabets large,
> (> 255) hence the KMP (which is only using good-suffix shift) will
> almost never outperform BM in usual text-searching situations. (Note
> that KMP has not been used in the original std::search implementation
> of the STL, because even the straightforward (brute-force) algorithm
> has been considered to be superior in the average case!)?
>
> The only practical case that I could think of, where the KMP could
> possibly outperform the BM are the DNA searches, where the patterns
> can be very long and the alphabet has only four characters. And even
> then, I would recommend careful performance testing, in order to
> ensure that KMP can be significantly faster than BM.

Thanks Jim. Marshall, could you distil something from that for the docs?

How efficiently could the current algorithms be used with DNA data?
Does anyone have any experience with a packed base-4 representation?

> However, in my opinion the most interesting and useful property of
> the KMP algorithm is not its performance, but its ability to help in
> building very generic sequence searching algorithms. In fact KMP has
> helped me to build an efficient generic sequence searching algorithm,
> which actually works with ranges accessed by single-pass (input)
> iterators. (See http://www.codeproject.com/KB/stl/single_pass_search.aspx
> and also in boost-vault/Algorithms) Needless to say that I would be
> more than happy to submit this work to boost, whenever there is an
> interest.

I am reminded that Spirit has some sort of adaptor that allows an input
iterator to be used with a grammar that requires backtracking by
storing recent inputs in a buffer. Presumably something like this
could be used here. How much does your algorithm gain over a
combination of such a buffer and one of Marshall's algorithms?

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk