Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2010-04-06 16:10:10


Stefan wrote:
> * Optimize Boost String Algorithm finders to use an efficient substring
> matching algorithm as opposed to the naive algorithm. Most likely
> because of the fact that boost has implemented the finders for strings
> as generic algorithms, they haven't used efficient algorithms such as
> KMP, Boyer-Moore, Rabin-Karp etc.
> \see boost/algorithm/string/detail/finder.hpp
> The idea is to implement efficient algorithms such as the ones
> mentioned above to replace the current naive implementation. Note: it
> may be difficult to find nth matching substring position using some of
> these algorithms.
> Difficulty: easy-medium
>
> * Implement a string algorithm for generating the suffix array. The
> suffix array, after being precomputed on a certain string, allows for
> very efficient substring searches in the future. This is useful when the
> string doesn't change too often but there are plenty of substring to
> search against.

Hi Stefan,

Have you seen this thread, from a couple of years ago? :

     http://thread.gmane.org/gmane.comp.lib.boost.devel/171098

I don't know what has changed in string_algo since then, but I think
there is probably plenty of opportunity for easy optimisation.

Do you have a motivating application for this? A lot of this stuff is
rather data-dependant, so having a particular application that can
provide a benchmark is a good idea. I guess that a suffix array
library would be reasonable for a GSoC project, if you're good! You
might also like to consider variants that index at e.g. word
boundaries; I have used something like that and it works well for some
kinds of search.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk