Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-07 14:58:03


Phil Endecott wrote:
> Stefan wrote:
>> * Optimize Boost String Algorithm finders to use an efficient
>> substring matching algorithm as opposed to the naive algorithm. Most
>> likely because of the fact that boost has implemented the finders for
>> strings as generic algorithms, they haven't used efficient algorithms
>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>> \see boost/algorithm/string/detail/finder.hpp
>> The idea is to implement efficient algorithms such as the ones
>> mentioned above to replace the current naive implementation. Note: it
>> may be difficult to find nth matching substring position using some of
>> these algorithms.
>> Difficulty: easy-medium
>>
>> * Implement a string algorithm for generating the suffix array. The
>> suffix array, after being precomputed on a certain string, allows for
>> very efficient substring searches in the future. This is useful when
>> the string doesn't change too often but there are plenty of substring
>> to search against.
>
> Hi Stefan,
>
> Have you seen this thread, from a couple of years ago? :
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>
> I don't know what has changed in string_algo since then, but I think
> there is probably plenty of opportunity for easy optimisation.
>
> Do you have a motivating application for this? A lot of this stuff is
> rather data-dependant, so having a particular application that can
> provide a benchmark is a good idea. I guess that a suffix array library
> would be reasonable for a GSoC project, if you're good! You might also
> like to consider variants that index at e.g. word boundaries; I have
> used something like that and it works well for some kinds of search.
>
>
> Regards, Phil.
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
I don't have an application proving it's inefficient yet, but a O(nm)
algorithm would almost always perform worse than a O(n+m) one.
I've just looked at the posted thread, but the algorithm is_any_of only
looks for a character in a string, not for a substring in a string.
Let me know if you would like a benchmark comparing the
boost::algorithm::find_*() with implementations of KMP, Boyer-Moore or
other algorithms.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk