Subject: Re: [boost] [Review:Algorithms] Comments about search algorithms
Date: 2011-09-27 18:56:41
> Phil Endecott wrote:
> > - A user faced with these three (or four) algorithms will need to
> > choose one. The docs should offer some guidance, e.g. benchmarks or
> > rules-of-thumb, about which to use.
> In particular, I've not yet found a case where KMP is the best choice.
> BM seems better in all the cases that I've tried and has the same or
> better complexity bounds. (Well, it might perform slightly better when
> the pattern length is 1 but in that case, std::find could be used.)
> Does anyone know of any cases where KMP is likely to be better than
> BM? (If we can't find any, is it worth including it?)
I will try to give a complete answer...
The most notable (and common) techniques for improving the sequence searching performance are the bad-character shift (occurrence shift) and the good-suffix shift (matching shift). The Boyer-Moore algorithm implements both these techniques, while the Boyer-Moore-Horspool only implements the bad-character shift and the KMP is merely based on the good-suffix shift. (Although the Boost.algorithm documentation only states that Horspool "is a refinement of the Boyer-Moore algorithm that trades space for time", this is by no means the most notable difference between these two algorithms.)
In general, bad-character shift performs very good when the search-pattern is relatively small and the alphabet relatively large. On the other hand, good-suffix shift helps more when the pattern is relatively large and the alphabet small. In the most common cases of text searching, the patterns are usually small and the alphabets large, (> 255) hence the KMP (which is only using good-suffix shift) will almost never outperform BM in usual text-searching situations. (Note that KMP has not been used in the original std::search implementation of the STL, because even the straightforward (brute-force) algorithm has been considered to be superior in the average case!)
The only practical case that I could think of, where the KMP could possibly outperform the BM are the DNA searches, where the patterns can be very long and the alphabet has only four characters. And even then, I would recommend careful performance testing, in order to ensure that KMP can be significantly faster than BM.
However, in my opinion the most interesting and useful property of the KMP algorithm is not its performance, but its ability to help in building very generic sequence searching algorithms. In fact KMP has helped me to build an efficient generic sequence searching algorithm, which actually works with ranges accessed by single-pass (input) iterators. (See http://www.codeproject.com/KB/stl/single_pass_search.aspx and also in boost-vault/Algorithms) Needless to say that I would be more than happy to submit this work to boost, whenever there is an interest.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk