Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:05:12

Next message: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Previous message: Davi Barbosa: "[boost] [GSoC] - Graph.boost - Graph Connectives"
In reply to: Phil Endecott: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Next in thread: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Reply: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"

Phil Endecott wrote:
> Stefan wrote:
>> * Optimize Boost String Algorithm finders to use an efficient
>> substring matching algorithm as opposed to the naive algorithm. Most
>> likely because of the fact that boost has implemented the finders for
>> strings as generic algorithms, they haven't used efficient algorithms
>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>> \see boost/algorithm/string/detail/finder.hpp
>> The idea is to implement efficient algorithms such as the ones
>> mentioned above to replace the current naive implementation. Note: it
>> may be difficult to find nth matching substring position using some of
>> these algorithms.
>> Difficulty: easy-medium
>>
>> * Implement a string algorithm for generating the suffix array. The
>> suffix array, after being precomputed on a certain string, allows for
>> very efficient substring searches in the future. This is useful when
>> the string doesn't change too often but there are plenty of substring
>> to search against.
>
> Hi Stefan,
>
> Have you seen this thread, from a couple of years ago? :
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
>
> I don't know what has changed in string_algo since then, but I think
> there is probably plenty of opportunity for easy optimisation.
>
> Do you have a motivating application for this? A lot of this stuff is
> rather data-dependant, so having a particular application that can
> provide a benchmark is a good idea. I guess that a suffix array library
> would be reasonable for a GSoC project, if you're good! You might also
> like to consider variants that index at e.g. word boundaries; I have
> used something like that and it works well for some kinds of search.
>
>
> Regards, Phil.
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
I've decided to implement Rabin-Karp today and after benchmarking I
realized that on my pseudo-randomly generated data it performs worse
than boost. My Rabin-Karp implementation will sure need some profiling
before it can be used in benchmarking. Also some better data sets than
what I came up with.
Here it is:

/** Rabin-karp
  *
  * \author mstefanro_at_[hidden]
  */

#include <boost/cstdint.hpp>
#include <vector>
#include <utility>
#include <algorithm>
#include <boost/utility.hpp>
#include <boost/tuple/tuple.hpp>
#include <cassert>

//Note: I've wasted a few hours debugging because I initially haven't
used the uintmax_t for logpow().

//! Binary exponentiation in O(log exp). This should also be added to
boost unless it already exists and I'm not aware of it
//! Very quickly computes (base^exp)%mod
template <class Uint>
Uint logpow (Uint base, Uint exp, Uint mod)
{
        boost::uintmax_t ret=1, exponential_base = base;
        for (; exp; exp >>= 1)
        {
                if (exp&1) ret = (ret*exponential_base)%mod;
                exponential_base = (exponential_base*exponential_base)%mod;
        }
        return static_cast<Uint>(ret);
}

//! @TODO: add support for Container=char[],char*,wchar_t[],wchar_t*
//! @TODO: accept generic matches container rather than just std::vector
in find()
//! @TODO: choose smaller primes (so that 2*prime fits in uint32)

template <
        class Container, bool Heuristic = false, class
Comparator=std::equal_to<Container::value_type>,
        class HashType = boost::uint32_t,
        HashType FirstBase=257u, HashType SecondBase=277u,
        HashType FirstModulo=7624679u, HashType SecondModulo=7116647u
>
class rabin_karp
{
public:
        rabin_karp (Container const &substring)
                : substring_(substring), first_hash(0u), second_hash(0u), comparator()
        {
                if (!substring.empty())
                {
                        boost::tie(first_hash, second_hash) =
                                compute_hashes(substring.begin(), substring.end());

                        first_additive_inverse =
                                FirstModulo - ::logpow(FirstBase ,
static_cast<HashType>(substring.size()-1), FirstModulo );
                        second_additive_inverse =
                                SecondModulo - ::logpow(SecondBase,
static_cast<HashType>(substring.size()-1), SecondModulo);
                }
        }
        template <class Container2>
        bool find(
                Container2 const &string,
                std::vector<typename Container2::const_iterator> &matches,
                unsigned int stopAt=0)
        {
                HashType current_first_hash, current_second_hash;
                bool found = false;

                matches.clear();
                if (string.size() < substring_.size() || substring_.empty()) return false;

                boost::tie(current_first_hash, current_second_hash) =
compute_hashes(string.begin(), string.begin()+substring_.size());
                Container2::const_iterator iter = string.begin()+substring_.length()-1;
                do {
                        ++iter;
                        if (first_hash == current_first_hash &&
                                second_hash == current_second_hash &&
                                test_for_match<Heuristic>(iter-substring_.length(), iter)
                                )
                        {
                                // Got a match
                                matches.push_back(iter-substring_.length());
                                found = true;
                                if (stopAt) { --stopAt; if (!stopAt) break; }
                        }

                        if (iter == string.end()) break;
                        else
                        {
                                // we've got a new character, update the hashes
                                // remove *(iter-substring.length())
                                // add *iter
                                HashType remove = *(iter-substring_.length()), add = *iter;
                                current_first_hash = ( (current_first_hash +
(remove*first_additive_inverse)%FirstModulo)*FirstBase ) % FirstModulo;
                                current_first_hash = (current_first_hash + add)%FirstModulo;

                                current_second_hash = ( (current_second_hash +
(remove*second_additive_inverse)%SecondModulo)*SecondBase ) % SecondModulo;
                                current_second_hash = (current_second_hash + add)%SecondModulo;
                        }
                } while (true);
                return found;
        }
private:
        Container substring_;
        Comparator comparator;
        HashType first_hash, second_hash;
        HashType first_additive_inverse, second_additive_inverse;

        template <class Iterator>
        std::pair<HashType, HashType> compute_hashes (Iterator begin, Iterator end)
        {
                HashType first=0u, second=0u;
                for (Iterator iter = begin; iter != end; ++iter)
                {
                        first = (first * FirstBase + *iter) % FirstModulo ;
                        second = (second * SecondBase + *iter) % SecondModulo;
                }
                return std::make_pair(first, second);
        }

        //! Check if [begin,end) == [ substring_.begin(),substring_.end() )
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == false,bool>::type
test_for_match(Iterator begin, Iterator end)
        {
                Iterator iter2 = begin;
                assert(end-begin == substring_.length());
                for (Container::const_iterator iter=substring_.begin();
                                iter != substring_.end(); ++iter, ++iter2)
                {
                        if (!comparator(*iter, *iter2)) return false;
                }
                return true;
        }

        //! Heuristic specialization, omit testing character by character.
        //! One should hope for no collisions on both hash functions at once
(highly unlikely)
        //! If it's seen happening, add a third hash function?
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == true,bool>::type
test_for_match(Iterator begin, Iterator end)
        {
                return true;
        }
};

There's both an heuristic and a deterministic algorithm, but the
heuristic one has never failed me so far (tested on around 20 mil inputs
the heuristic implementation has always yielded the correct results).
Typical usage is:
  std::vector<std::string::const_iterator> matches;
  rabin_karp<std::string> rk("my substring");
  if (rk.find(std::string("my string which contains my substring"),
matches)) { /* ... */ }
Or for the heuristic version:
  ...
  rabin_karp<std::string, true> rk("my substring");
  ...

I doubt rabin-karp normally performs so badly, therefore my
implementation definitely needs improvement.

Next message: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Previous message: Davi Barbosa: "[boost] [GSoC] - Graph.boost - Graph Connectives"
In reply to: Phil Endecott: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Next in thread: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"
Reply: Stefan: "Re: [boost] [GSoC] Some new ideas for this year's GSoC"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk