Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:05:12


Phil Endecott wrote:
> Stefan wrote:
>> * Optimize Boost String Algorithm finders to use an efficient
>> substring matching algorithm as opposed to the naive algorithm. Most
>> likely because of the fact that boost has implemented the finders for
>> strings as generic algorithms, they haven't used efficient algorithms
>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>> \see boost/algorithm/string/detail/finder.hpp
>> The idea is to implement efficient algorithms such as the ones
>> mentioned above to replace the current naive implementation. Note: it
>> may be difficult to find nth matching substring position using some of
>> these algorithms.
>> Difficulty: easy-medium
>>
>> * Implement a string algorithm for generating the suffix array. The
>> suffix array, after being precomputed on a certain string, allows for
>> very efficient substring searches in the future. This is useful when
>> the string doesn't change too often but there are plenty of substring
>> to search against.
>
> Hi Stefan,
>
> Have you seen this thread, from a couple of years ago? :
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>
> I don't know what has changed in string_algo since then, but I think
> there is probably plenty of opportunity for easy optimisation.
>
> Do you have a motivating application for this? A lot of this stuff is
> rather data-dependant, so having a particular application that can
> provide a benchmark is a good idea. I guess that a suffix array library
> would be reasonable for a GSoC project, if you're good! You might also
> like to consider variants that index at e.g. word boundaries; I have
> used something like that and it works well for some kinds of search.
>
>
> Regards, Phil.
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
I've decided to implement Rabin-Karp today and after benchmarking I
realized that on my pseudo-randomly generated data it performs worse
than boost. My Rabin-Karp implementation will sure need some profiling
before it can be used in benchmarking. Also some better data sets than
what I came up with.
Here it is:

/** Rabin-karp
  *
  * \author mstefanro_at_[hidden]
  */

#include <boost/cstdint.hpp>
#include <vector>
#include <utility>
#include <algorithm>
#include <boost/utility.hpp>
#include <boost/tuple/tuple.hpp>
#include <cassert>

//Note: I've wasted a few hours debugging because I initially haven't
used the uintmax_t for logpow().

//! Binary exponentiation in O(log exp). This should also be added to
boost unless it already exists and I'm not aware of it
//! Very quickly computes (base^exp)%mod
template <class Uint>
Uint logpow (Uint base, Uint exp, Uint mod)
{
        boost::uintmax_t ret=1, exponential_base = base;
        for (; exp; exp >>= 1)
        {
                if (exp&1) ret = (ret*exponential_base)%mod;
                exponential_base = (exponential_base*exponential_base)%mod;
        }
        return static_cast<Uint>(ret);
}

//! @TODO: add support for Container=char[],char*,wchar_t[],wchar_t*
//! @TODO: accept generic matches container rather than just std::vector
in find()
//! @TODO: choose smaller primes (so that 2*prime fits in uint32)

template <
        class Container, bool Heuristic = false, class
Comparator=std::equal_to<Container::value_type>,
        class HashType = boost::uint32_t,
        HashType FirstBase=257u, HashType SecondBase=277u,
        HashType FirstModulo=7624679u, HashType SecondModulo=7116647u
>
class rabin_karp
{
public:
        rabin_karp (Container const &substring)
                : substring_(substring), first_hash(0u), second_hash(0u), comparator()
        {
                if (!substring.empty())
                {
                        boost::tie(first_hash, second_hash) =
                                compute_hashes(substring.begin(), substring.end());

                        first_additive_inverse =
                                FirstModulo - ::logpow(FirstBase ,
static_cast<HashType>(substring.size()-1), FirstModulo );
                        second_additive_inverse =
                                SecondModulo - ::logpow(SecondBase,
static_cast<HashType>(substring.size()-1), SecondModulo);
                }
        }
        template <class Container2>
        bool find(
                Container2 const &string,
                std::vector<typename Container2::const_iterator> &matches,
                unsigned int stopAt=0)
        {
                HashType current_first_hash, current_second_hash;
                bool found = false;
                
                matches.clear();
                if (string.size() < substring_.size() || substring_.empty()) return false;
                
                boost::tie(current_first_hash, current_second_hash) =
compute_hashes(string.begin(), string.begin()+substring_.size());
                Container2::const_iterator iter = string.begin()+substring_.length()-1;
                do {
                        ++iter;
                        if (first_hash == current_first_hash &&
                                second_hash == current_second_hash &&
                                test_for_match<Heuristic>(iter-substring_.length(), iter)
                                )
                        {
                                // Got a match
                                matches.push_back(iter-substring_.length());
                                found = true;
                                if (stopAt) { --stopAt; if (!stopAt) break; }
                        }

                        if (iter == string.end()) break;
                        else
                        {
                                // we've got a new character, update the hashes
                                // remove *(iter-substring.length())
                                // add *iter
                                HashType remove = *(iter-substring_.length()), add = *iter;
                                current_first_hash = ( (current_first_hash +
(remove*first_additive_inverse)%FirstModulo)*FirstBase ) % FirstModulo;
                                current_first_hash = (current_first_hash + add)%FirstModulo;
                                
                                current_second_hash = ( (current_second_hash +
(remove*second_additive_inverse)%SecondModulo)*SecondBase ) % SecondModulo;
                                current_second_hash = (current_second_hash + add)%SecondModulo;
                        }
                } while (true);
                return found;
        }
private:
        Container substring_;
        Comparator comparator;
        HashType first_hash, second_hash;
        HashType first_additive_inverse, second_additive_inverse;

        template <class Iterator>
        std::pair<HashType, HashType> compute_hashes (Iterator begin, Iterator end)
        {
                HashType first=0u, second=0u;
                for (Iterator iter = begin; iter != end; ++iter)
                {
                        first = (first * FirstBase + *iter) % FirstModulo ;
                        second = (second * SecondBase + *iter) % SecondModulo;
                }
                return std::make_pair(first, second);
        }

        //! Check if [begin,end) == [ substring_.begin(),substring_.end() )
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == false,bool>::type
test_for_match(Iterator begin, Iterator end)
        {
                Iterator iter2 = begin;
                assert(end-begin == substring_.length());
                for (Container::const_iterator iter=substring_.begin();
                                iter != substring_.end(); ++iter, ++iter2)
                {
                        if (!comparator(*iter, *iter2)) return false;
                }
                return true;
        }

        //! Heuristic specialization, omit testing character by character.
        //! One should hope for no collisions on both hash functions at once
(highly unlikely)
        //! If it's seen happening, add a third hash function?
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == true,bool>::type
test_for_match(Iterator begin, Iterator end)
        {
                return true;
        }
};

There's both an heuristic and a deterministic algorithm, but the
heuristic one has never failed me so far (tested on around 20 mil inputs
the heuristic implementation has always yielded the correct results).
Typical usage is:
  std::vector<std::string::const_iterator> matches;
  rabin_karp<std::string> rk("my substring");
  if (rk.find(std::string("my string which contains my substring"),
matches)) { /* ... */ }
Or for the heuristic version:
  ...
  rabin_karp<std::string, true> rk("my substring");
  ...

I doubt rabin-karp normally performs so badly, therefore my
implementation definitely needs improvement.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk