Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-12 15:15:13


Dne 12.4.2013 19:50, Michael Marcin napsal(a):
> On 4/12/13 4:52 AM, Jan Strnad wrote:
>> Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech
>> Republic. This summer, I would like to participate in GSOC and I believe
>> that approximate string matching is a good match for me, since my
>> background i this area is quite strong.
>>
>> Now more specifically. I would like to implement approximate string
>> matching algorithm(s) based on the NFA. I would like to support various
>> approximate distances, such as Hamming distance [1], Levenstein distance
>> [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and
>> finally (delta, gamma) distance. Sorry no explanation of the last three
>> found on the Web, but I can provide one if interested.
>>
>> In the matter of NFA implementation, I would probably choose dynamic
>> programming approach, but I'm thinking of shift-or algorithm as well.
>>
>> Can I ask you for any kind of thoughts or feedback regarding this topic?
>>
>> Best regards,
>> Jan Strnad
>>
>> [1] http://en.wikipedia.org/wiki/Hamming_distance
>> [2] http://en.wikipedia.org/wiki/Levenstein_distance
>> [3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
>>
>
> Sounds very useful!
>
> I guess this would fit in the string algorithm library?

That seems to me reasonable as well.

>
> What would the interface look like?
>
> Is it as simple as:
>
> namespace boost {
> namespace algorithm {
>
> template< typename Range1T, typename Range2T >
> std::size_t levenstein_distance( Range1T first, Range2T second );
>
> }
> }
>

Sure, that would be possible, but I was originally aiming at something
little bit different. Basically, I want to design an algorithm that
would find all (or just the first) occurrences of a given pattern with
given distance in some string.

So to your question, simple interface could look like this:
namespace boost
{
   namespace algorithm
   {
           template <typename T>
           class approximate_distance;

     template <typename TRange>
     find_iterator<TRange> approximate_find
        (TRange text,
          approximate_distance<TRange>& model);
   }
}

And a simple usage:

string pattern = "acbdef";
approximate_distance<string>* model =
   new hamming_distance<string>(pattern, 1);
find_iterator<string> result = approximate_find("aacdefbcdefabccef");
while(result())
{
   cout << * result << endl;
   ++result;
}
// prints aacdef, fbcdef and abccef


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk