Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Michael Marcin (mike.marcin_at_[hidden])
Date: 2013-04-17 23:59:12


On 4/17/2013 4:54 PM, Jan Strnad wrote:
> Dne 17.4.2013 20:00, Michael Marcin napsal(a):
>> On 4/17/2013 12:40 PM, Jan Strnad wrote:
>>>
>>> Finally, I would implement functions
>>> template <typename TRange>
>>> find_iterator<TRange> approximate_find
>>> (TRange text, TRange pattern, approximate_distance<TRange>&
>>> distance);
>>>
>>> template <typename TRange>
>>> find_iterator<std::size_t> approximate_find_index
>>> (TRange text, TRange pattern, approximate_distance<TRange>&
>>> distance);
>>>
>>> These would return all approximate matches of pattern int text. The
>>> first one return the approximate string itself, the later one returns
>>> its index.
>>>
>>
>> These functions are for finding substrings in the text?
>>
>> This seems to me like how the regex_search interface works. Is it
>> possible to make the interface similar?
>>
>> Something like
>>
>> string pattern = "acbdef";
>> string text = "aacdefbcdefabccef";
>> boost::amatch m;
>> boost::approx e( hamming_distance, pattern, 1 );
>> while( boost::approx_search( text, m, e )
>> {
>> cout << m.str() << endl;
>> text = m.suffix().str();
>> }
>> // prints aacdef, fbcdef abccef
>>
>>
>> Although I could be seeing similarities where they don't exist.
>
> Well, the problem with this interface is that may not find all
> occurrences of pattern. Consider text="ababa" and pattern "aba". There
> are clearly two matches, the first starts at index 0 and the second at
> index 2. Using the interface above, I only get the one at index 0.
>
> The point is that I can not just drop all matched characters. I may
> still need them.
>
> So, I think the interface could look like this:
> string pattern = "acbdef";
> string text = "aacdefbcdefabccef";
> boost::amatches m;
> boost::approx e( hamming_distance, pattern, 1 );
> boost::approx_search(text, m, e);
> while(m.next())
> {
> cout << m.str() << endl;
> }
> // prints aacdef, fbcdef abccef
>
> Therefore, boost:approx_search should return an interator-like structure.
>
>

I see.

In that case try to use an iterator interface that plays more nicely
with the STL. Maybe something closer to regex_iterator. Specifically
look at the ForwardIterator concept.

Maybe something like:

string pattern = "acbdef";
string text = "aacdefbcdefabccef";
boost::approx_pattern ap( hamming_distance, pattern, 1 );
boost::sapprox_match_iterator begin( text, ap );
boost::sapprox_match_iterator end;
std::for_each( begin, end,
[]( const boost::sapprox_match_result& what )
{
     cout << what.str() << " starting at " << what.position() << endl;
} );


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk