Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-17 13:40:13


So, to sum it up:
I would like to implement following approximate distance functions as an
extension of Boost String Algorithm library:

Hamming distance
Levenstein distance
Generalized Levenstein (weighted Levenstein) distance
Damerau–Levenshtein distance
Weighted Damerau–Levenshtein distance
Jaro distance
Jaro-Winkler distance

For an ordered alphabet, I would also implement
Delta distance
Gamma distance

The interface for most distances would look like:

template< typename Range1T, typename Range2T >
std::size_t hamming_distance( Range1T first, Range2T second );

Jaro and Jaro-Winkler distances have following interface (return float
instead of size_t):

template< typename Range1T, typename Range2T >
float hamming_distance( Range1T first, Range2T second );

Finally, I would implement functions
template <typename TRange>
find_iterator<TRange> approximate_find
  (TRange text, TRange pattern, approximate_distance<TRange>& distance);

template <typename TRange>
find_iterator<std::size_t> approximate_find_index
  (TRange text, TRange pattern, approximate_distance<TRange>& distance);

These would return all approximate matches of pattern int text. The
first one return the approximate string itself, the later one returns
its index.

Do you have any further suggestions? Is there anybody willing to mentor
this project?

Jan Strnad


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk