Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-18 06:40:57


Dne 18.4.2013 05:59, Michael Marcin napsal(a):
> On 4/17/2013 4:54 PM, Jan Strnad wrote:
>> Dne 17.4.2013 20:00, Michael Marcin napsal(a):
>>> On 4/17/2013 12:40 PM, Jan Strnad wrote:
>>>>
>>>> Finally, I would implement functions
>>>> template <typename TRange>
>>>> find_iterator<TRange> approximate_find
>>>> (TRange text, TRange pattern, approximate_distance<TRange>&
>>>> distance);
>>>>
>>>> template <typename TRange>
>>>> find_iterator<std::size_t> approximate_find_index
>>>> (TRange text, TRange pattern, approximate_distance<TRange>&
>>>> distance);
>>>>
>>>> These would return all approximate matches of pattern int text. The
>>>> first one return the approximate string itself, the later one returns
>>>> its index.
>>>>
>>>
>>> These functions are for finding substrings in the text?
>>>
>>> This seems to me like how the regex_search interface works. Is it
>>> possible to make the interface similar?
>>>
>>> Something like
>>>
>>> string pattern = "acbdef";
>>> string text = "aacdefbcdefabccef";
>>> boost::amatch m;
>>> boost::approx e( hamming_distance, pattern, 1 );
>>> while( boost::approx_search( text, m, e )
>>> {
>>> cout << m.str() << endl;
>>> text = m.suffix().str();
>>> }
>>> // prints aacdef, fbcdef abccef
>>>
>>>
>>> Although I could be seeing similarities where they don't exist.
>>
>> Well, the problem with this interface is that may not find all
>> occurrences of pattern. Consider text="ababa" and pattern "aba". There
>> are clearly two matches, the first starts at index 0 and the second at
>> index 2. Using the interface above, I only get the one at index 0.
>>
>> The point is that I can not just drop all matched characters. I may
>> still need them.
>>
>> So, I think the interface could look like this:
>> string pattern = "acbdef";
>> string text = "aacdefbcdefabccef";
>> boost::amatches m;
>> boost::approx e( hamming_distance, pattern, 1 );
>> boost::approx_search(text, m, e);
>> while(m.next())
>> {
>> cout << m.str() << endl;
>> }
>> // prints aacdef, fbcdef abccef
>>
>> Therefore, boost:approx_search should return an interator-like structure.
>>
>>
>
> I see.
>
> In that case try to use an iterator interface that plays more nicely
> with the STL. Maybe something closer to regex_iterator. Specifically
> look at the ForwardIterator concept.
>
> Maybe something like:
>
> string pattern = "acbdef";
> string text = "aacdefbcdefabccef";
> boost::approx_pattern ap( hamming_distance, pattern, 1 );
> boost::sapprox_match_iterator begin( text, ap );
> boost::sapprox_match_iterator end;
> std::for_each( begin, end,
> []( const boost::sapprox_match_result& what )
> {
> cout << what.str() << " starting at " << what.position() << endl;
> } );
>
This seems as the best choice to me right now.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk