Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2010-04-07 19:13:44


Stefan wrote:
>>> * Optimize Boost String Algorithm finders to use an efficient
>>> substring matching algorithm as opposed to the naive algorithm. Most
>>> likely because of the fact that boost has implemented the finders for
>>> strings as generic algorithms, they haven't used efficient algorithms
>>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>>> \see boost/algorithm/string/detail/finder.hpp
>>> The idea is to implement efficient algorithms such as the ones
>>> mentioned above to replace the current naive implementation. Note: it
>>> may be difficult to find nth matching substring position using some of
>>> these algorithms.

> I don't have an application proving it's inefficient yet, but a O(nm)
> algorithm would almost always perform worse than a O(n+m) one.

So n = size of input and m = size of search pattern.

If m is typically small, e.g. 2, then O(n+m) could easily beat O(nm) if
the constant factors are in its favour.

Even if m is large, then there will often be a dominant case where the
potential match can be rejected after looking at just the first element
(i.e. if I'm looking for xyz in abcdefgxyzlmnop, then a "worse case
O(nm)" algorithm will actually take O(n)).

I think that having a test data set in mind in advance would make this
much more concrete to reason about. Does anyone have any suggestions?

> I've just looked at the posted thread, but the algorithm is_any_of only
> looks for a character in a string, not for a substring in a string.

I guess my point was that even the simpler problem of finding a single
character has lots of potential for interesting work.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk