Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-17 17:54:22


Dne 17.4.2013 20:00, Michael Marcin napsal(a):
> On 4/17/2013 12:40 PM, Jan Strnad wrote:
>>
>> Finally, I would implement functions
>> template <typename TRange>
>> find_iterator<TRange> approximate_find
>> (TRange text, TRange pattern, approximate_distance<TRange>& distance);
>>
>> template <typename TRange>
>> find_iterator<std::size_t> approximate_find_index
>> (TRange text, TRange pattern, approximate_distance<TRange>& distance);
>>
>> These would return all approximate matches of pattern int text. The
>> first one return the approximate string itself, the later one returns
>> its index.
>>
>
> These functions are for finding substrings in the text?
>
> This seems to me like how the regex_search interface works. Is it
> possible to make the interface similar?
>
> Something like
>
> string pattern = "acbdef";
> string text = "aacdefbcdefabccef";
> boost::amatch m;
> boost::approx e( hamming_distance, pattern, 1 );
> while( boost::approx_search( text, m, e )
> {
> cout << m.str() << endl;
> text = m.suffix().str();
> }
> // prints aacdef, fbcdef abccef
>
>
> Although I could be seeing similarities where they don't exist.

Well, the problem with this interface is that may not find all
occurrences of pattern. Consider text="ababa" and pattern "aba". There
are clearly two matches, the first starts at index 0 and the second at
index 2. Using the interface above, I only get the one at index 0.

The point is that I can not just drop all matched characters. I may
still need them.

So, I think the interface could look like this:
string pattern = "acbdef";
string text = "aacdefbcdefabccef";
boost::amatches m;
boost::approx e( hamming_distance, pattern, 1 );
boost::approx_search(text, m, e);
while(m.next())
{
    cout << m.str() << endl;
}
// prints aacdef, fbcdef abccef

Therefore, boost:approx_search should return an interator-like structure.

Jan Strnad


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk