Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:16:41


Stefan wrote:
> Phil Endecott wrote:
>> Stefan wrote:
>>> * Optimize Boost String Algorithm finders to use an efficient
>>> substring matching algorithm as opposed to the naive algorithm. Most
>>> likely because of the fact that boost has implemented the finders for
>>> strings as generic algorithms, they haven't used efficient algorithms
>>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>>> \see boost/algorithm/string/detail/finder.hpp
>>> The idea is to implement efficient algorithms such as the ones
>>> mentioned above to replace the current naive implementation. Note: it
>>> may be difficult to find nth matching substring position using some
>>> of these algorithms.
>>> Difficulty: easy-medium
>>>
>>> * Implement a string algorithm for generating the suffix array. The
>>> suffix array, after being precomputed on a certain string, allows for
>>> very efficient substring searches in the future. This is useful when
>>> the string doesn't change too often but there are plenty of substring
>>> to search against.
>>
>> Hi Stefan,
>>
>> Have you seen this thread, from a couple of years ago? :
>>
>> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>>
>> I don't know what has changed in string_algo since then, but I think
>> there is probably plenty of opportunity for easy optimisation.
>>
>> Do you have a motivating application for this? A lot of this stuff is
>> rather data-dependant, so having a particular application that can
>> provide a benchmark is a good idea. I guess that a suffix array
>> library would be reasonable for a GSoC project, if you're good! You
>> might also like to consider variants that index at e.g. word
>> boundaries; I have used something like that and it works well for some
>> kinds of search.
>>
>>
>> Regards, Phil.
>>
>>
>>
>> _______________________________________________
>> Unsubscribe & other changes:
>> http://lists.boost.org/mailman/listinfo.cgi/boost
>>
> I've decided to implement Rabin-Karp today and after benchmarking I
> realized that on my pseudo-randomly generated data it performs worse
> than boost. My Rabin-Karp implementation will sure need some profiling
> before it can be used in benchmarking. Also some better data sets than
> what I came up with.
> Here it is:
>
> ...
> I doubt rabin-karp normally performs so badly, therefore my
> implementation definitely needs improvement.
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

Sorry for double-posting, no idea why it happened. Also, one can neglect
the comments "//! @TODO: choose smaller primes (so that 2*prime fits in
uint32)" and "//! Heuristic specialization, omit testing character by
character." as they are incorrect. The former has been solved whereas
the latter is not a specialization, although was meant to be at one point


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk