Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2010-04-07 19:59:17


Phil Endecott wrote:
> Stefan wrote:
>>>> * Optimize Boost String Algorithm finders to use an efficient
>>>> substring matching algorithm as opposed to the naive algorithm.
>>>> Most likely because of the fact that boost has implemented the
>>>> finders for strings as generic algorithms, they haven't used
>>>> efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc.
>>>> \see boost/algorithm/string/detail/finder.hpp
>>>> The idea is to implement efficient algorithms such as the ones
>>>> mentioned above to replace the current naive implementation. Note:
>>>> it may be difficult to find nth matching substring position using
>>>> some of these algorithms.
>
>> I don't have an application proving it's inefficient yet, but a O(nm)
>> algorithm would almost always perform worse than a O(n+m) one.
>
> So n = size of input and m = size of search pattern.
>
> If m is typically small, e.g. 2, then O(n+m) could easily beat O(nm)
> if
> the constant factors are in its favour.
>
> Even if m is large, then there will often be a dominant case where the
> potential match can be rejected after looking at just the first
> element (i.e. if I'm looking for xyz in abcdefgxyzlmnop, then a
> "worse case
> O(nm)" algorithm will actually take O(n)).
>
> I think that having a test data set in mind in advance would make this
> much more concrete to reason about. Does anyone have any suggestions?

The application where this becomes important is DNA sequencing. If you have a very small alphebet and large substring to match in a very large string you can get poor performance from the naïve algorithm *if* there is a lot of repetition of the beginning of your substring in the string you are searching. This can happen frequently when working with genetic data, but otherwise is of very little practical concern. I'm not sure boost string algorithms is the place genetic researchers would turn for their DNA crunching. There is no doubt a host of academic and commercial libraries and applications available for this purpose including multi-threaded and massively distributed search.

Regards,
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk