Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:16:41

Next message: Daniel Walker: "Re: [boost] decltype and incomplete types"
Previous message: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
In reply to: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Next in thread: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"

Stefan wrote:
> Phil Endecott wrote:
>> Stefan wrote:
>>> * Optimize Boost String Algorithm finders to use an efficient
>>> substring matching algorithm as opposed to the naive algorithm. Most
>>> likely because of the fact that boost has implemented the finders for
>>> strings as generic algorithms, they haven't used efficient algorithms
>>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>>> \see boost/algorithm/string/detail/finder.hpp
>>> The idea is to implement efficient algorithms such as the ones
>>> mentioned above to replace the current naive implementation. Note: it
>>> may be difficult to find nth matching substring position using some
>>> of these algorithms.
>>> Difficulty: easy-medium
>>>
>>> * Implement a string algorithm for generating the suffix array. The
>>> suffix array, after being precomputed on a certain string, allows for
>>> very efficient substring searches in the future. This is useful when
>>> the string doesn't change too often but there are plenty of substring
>>> to search against.
>>
>> Hi Stefan,
>>
>> Have you seen this thread, from a couple of years ago? :
>>
>> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>>
>> I don't know what has changed in string_algo since then, but I think
>> there is probably plenty of opportunity for easy optimisation.
>>
>> Do you have a motivating application for this? A lot of this stuff is
>> rather data-dependant, so having a particular application that can
>> provide a benchmark is a good idea. I guess that a suffix array
>> library would be reasonable for a GSoC project, if you're good! You
>> might also like to consider variants that index at e.g. word
>> boundaries; I have used something like that and it works well for some
>> kinds of search.
>>
>>
>> Regards, Phil.
>>
>>
>>
>> _______________________________________________
>> Unsubscribe & other changes:
>> http://lists.boost.org/mailman/listinfo.cgi/boost
>>
> I've decided to implement Rabin-Karp today and after benchmarking I
> realized that on my pseudo-randomly generated data it performs worse
> than boost. My Rabin-Karp implementation will sure need some profiling
> before it can be used in benchmarking. Also some better data sets than
> what I came up with.
> Here it is:
>
> ...
> I doubt rabin-karp normally performs so badly, therefore my
> implementation definitely needs improvement.
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

Sorry for double-posting, no idea why it happened. Also, one can neglect
the comments "//! @TODO: choose smaller primes (so that 2*prime fits in
uint32)" and "//! Heuristic specialization, omit testing character by
character." as they are incorrect. The former has been solved whereas
the latter is not a specialization, although was meant to be at one point

Next message: Daniel Walker: "Re: [boost] decltype and incomplete types"
Previous message: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
In reply to: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Next in thread: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk