|
Boost : |
Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:05:12
Phil Endecott wrote:
> Stefan wrote:
>> * Optimize Boost String Algorithm finders to use an efficient
>> substring matching algorithm as opposed to the naive algorithm. Most
>> likely because of the fact that boost has implemented the finders for
>> strings as generic algorithms, they haven't used efficient algorithms
>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>> \see boost/algorithm/string/detail/finder.hpp
>> The idea is to implement efficient algorithms such as the ones
>> mentioned above to replace the current naive implementation. Note: it
>> may be difficult to find nth matching substring position using some of
>> these algorithms.
>> Difficulty: easy-medium
>>
>> * Implement a string algorithm for generating the suffix array. The
>> suffix array, after being precomputed on a certain string, allows for
>> very efficient substring searches in the future. This is useful when
>> the string doesn't change too often but there are plenty of substring
>> to search against.
>
> Hi Stefan,
>
> Have you seen this thread, from a couple of years ago? :
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>
> I don't know what has changed in string_algo since then, but I think
> there is probably plenty of opportunity for easy optimisation.
>
> Do you have a motivating application for this? A lot of this stuff is
> rather data-dependant, so having a particular application that can
> provide a benchmark is a good idea. I guess that a suffix array library
> would be reasonable for a GSoC project, if you're good! You might also
> like to consider variants that index at e.g. word boundaries; I have
> used something like that and it works well for some kinds of search.
>
>
> Regards, Phil.
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
I've decided to implement Rabin-Karp today and after benchmarking I
realized that on my pseudo-randomly generated data it performs worse
than boost. My Rabin-Karp implementation will sure need some profiling
before it can be used in benchmarking. Also some better data sets than
what I came up with.
Here it is:
/** Rabin-karp
*
* \author mstefanro_at_[hidden]
*/
#include <boost/cstdint.hpp>
#include <vector>
#include <utility>
#include <algorithm>
#include <boost/utility.hpp>
#include <boost/tuple/tuple.hpp>
#include <cassert>
//Note: I've wasted a few hours debugging because I initially haven't
used the uintmax_t for logpow().
//! Binary exponentiation in O(log exp). This should also be added to
boost unless it already exists and I'm not aware of it
//! Very quickly computes (base^exp)%mod
template <class Uint>
Uint logpow (Uint base, Uint exp, Uint mod)
{
boost::uintmax_t ret=1, exponential_base = base;
for (; exp; exp >>= 1)
{
if (exp&1) ret = (ret*exponential_base)%mod;
exponential_base = (exponential_base*exponential_base)%mod;
}
return static_cast<Uint>(ret);
}
//! @TODO: add support for Container=char[],char*,wchar_t[],wchar_t*
//! @TODO: accept generic matches container rather than just std::vector
in find()
//! @TODO: choose smaller primes (so that 2*prime fits in uint32)
template <
class Container, bool Heuristic = false, class
Comparator=std::equal_to<Container::value_type>,
class HashType = boost::uint32_t,
HashType FirstBase=257u, HashType SecondBase=277u,
HashType FirstModulo=7624679u, HashType SecondModulo=7116647u
>
class rabin_karp
{
public:
rabin_karp (Container const &substring)
: substring_(substring), first_hash(0u), second_hash(0u), comparator()
{
if (!substring.empty())
{
boost::tie(first_hash, second_hash) =
compute_hashes(substring.begin(), substring.end());
first_additive_inverse =
FirstModulo - ::logpow(FirstBase ,
static_cast<HashType>(substring.size()-1), FirstModulo );
second_additive_inverse =
SecondModulo - ::logpow(SecondBase,
static_cast<HashType>(substring.size()-1), SecondModulo);
}
}
template <class Container2>
bool find(
Container2 const &string,
std::vector<typename Container2::const_iterator> &matches,
unsigned int stopAt=0)
{
HashType current_first_hash, current_second_hash;
bool found = false;
matches.clear();
if (string.size() < substring_.size() || substring_.empty()) return false;
boost::tie(current_first_hash, current_second_hash) =
compute_hashes(string.begin(), string.begin()+substring_.size());
Container2::const_iterator iter = string.begin()+substring_.length()-1;
do {
++iter;
if (first_hash == current_first_hash &&
second_hash == current_second_hash &&
test_for_match<Heuristic>(iter-substring_.length(), iter)
)
{
// Got a match
matches.push_back(iter-substring_.length());
found = true;
if (stopAt) { --stopAt; if (!stopAt) break; }
}
if (iter == string.end()) break;
else
{
// we've got a new character, update the hashes
// remove *(iter-substring.length())
// add *iter
HashType remove = *(iter-substring_.length()), add = *iter;
current_first_hash = ( (current_first_hash +
(remove*first_additive_inverse)%FirstModulo)*FirstBase ) % FirstModulo;
current_first_hash = (current_first_hash + add)%FirstModulo;
current_second_hash = ( (current_second_hash +
(remove*second_additive_inverse)%SecondModulo)*SecondBase ) % SecondModulo;
current_second_hash = (current_second_hash + add)%SecondModulo;
}
} while (true);
return found;
}
private:
Container substring_;
Comparator comparator;
HashType first_hash, second_hash;
HashType first_additive_inverse, second_additive_inverse;
template <class Iterator>
std::pair<HashType, HashType> compute_hashes (Iterator begin, Iterator end)
{
HashType first=0u, second=0u;
for (Iterator iter = begin; iter != end; ++iter)
{
first = (first * FirstBase + *iter) % FirstModulo ;
second = (second * SecondBase + *iter) % SecondModulo;
}
return std::make_pair(first, second);
}
//! Check if [begin,end) == [ substring_.begin(),substring_.end() )
template <bool Heuristic_, class Iterator>
typename boost::enable_if_c<Heuristic_ == false,bool>::type
test_for_match(Iterator begin, Iterator end)
{
Iterator iter2 = begin;
assert(end-begin == substring_.length());
for (Container::const_iterator iter=substring_.begin();
iter != substring_.end(); ++iter, ++iter2)
{
if (!comparator(*iter, *iter2)) return false;
}
return true;
}
//! Heuristic specialization, omit testing character by character.
//! One should hope for no collisions on both hash functions at once
(highly unlikely)
//! If it's seen happening, add a third hash function?
template <bool Heuristic_, class Iterator>
typename boost::enable_if_c<Heuristic_ == true,bool>::type
test_for_match(Iterator begin, Iterator end)
{
return true;
}
};
There's both an heuristic and a deterministic algorithm, but the
heuristic one has never failed me so far (tested on around 20 mil inputs
the heuristic implementation has always yielded the correct results).
Typical usage is:
std::vector<std::string::const_iterator> matches;
rabin_karp<std::string> rk("my substring");
if (rk.find(std::string("my string which contains my substring"),
matches)) { /* ... */ }
Or for the heuristic version:
...
rabin_karp<std::string, true> rk("my substring");
...
I doubt rabin-karp normally performs so badly, therefore my
implementation definitely needs improvement.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk