Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-12 15:15:13

Dne 12.4.2013 19:50, Michael Marcin napsal(a):
> On 4/12/13 4:52 AM, Jan Strnad wrote:
>> Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech
>> Republic. This summer, I would like to participate in GSOC and I believe
>> that approximate string matching is a good match for me, since my
>> background i this area is quite strong.
>> Now more specifically. I would like to implement approximate string
>> matching algorithm(s) based on the NFA. I would like to support various
>> approximate distances, such as Hamming distance [1], Levenstein distance
>> [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and
>> finally (delta, gamma) distance. Sorry no explanation of the last three
>> found on the Web, but I can provide one if interested.
>> In the matter of NFA implementation, I would probably choose dynamic
>> programming approach, but I'm thinking of shift-or algorithm as well.
>> Can I ask you for any kind of thoughts or feedback regarding this topic?
>> Best regards,
>> Jan Strnad
>> [1]
>> [2]
>> [3]
> Sounds very useful!
> I guess this would fit in the string algorithm library?

That seems to me reasonable as well.

> What would the interface look like?
> Is it as simple as:
> namespace boost {
> namespace algorithm {
> template< typename Range1T, typename Range2T >
> std::size_t levenstein_distance( Range1T first, Range2T second );
> }
> }

Sure, that would be possible, but I was originally aiming at something
little bit different. Basically, I want to design an algorithm that
would find all (or just the first) occurrences of a given pattern with
given distance in some string.

So to your question, simple interface could look like this:
namespace boost
   namespace algorithm
           template <typename T>
           class approximate_distance;

     template <typename TRange>
     find_iterator<TRange> approximate_find
        (TRange text,
          approximate_distance<TRange>& model);

And a simple usage:

string pattern = "acbdef";
approximate_distance<string>* model =
   new hamming_distance<string>(pattern, 1);
find_iterator<string> result = approximate_find("aacdefbcdefabccef");
   cout << * result << endl;
// prints aacdef, fbcdef and abccef

Boost list run by bdawes at, gregod at, cpdaniel at, john at