Boost logo

Boost :

Subject: Re: [boost] [gsoc 2013] Approximate string matching
From: Jan Strnad (hanny.strnad_at_[hidden])
Date: 2013-04-16 11:54:18


Dne 15.4.2013 15:11, Adam D. Walling napsal(a):
> Jan Strnad <hanny.strnad <at> gmail.com> writes:
>
>>
>> Dne 12.4.2013 19:50, Michael Marcin napsal(a):
>>> On 4/12/13 4:52 AM, Jan Strnad wrote:
> ...
>>>> Now more specifically. I would like to implement approximate string
>>>> matching algorithm(s) based on the NFA. I would like to support various
>>>> approximate distances, such as Hamming distance [1], Levenstein
> distance
>>>> [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance
> and
>>>> finally (delta, gamma) distance.
> ...
>>>
>>> Sounds very useful!
>>>
>>> I guess this would fit in the string algorithm library?
>>
>> That seems to me reasonable as well.
>>
>>>
>>> What would the interface look like?
>>>
>>> Is it as simple as:
>>>
>>> namespace boost {
>>> namespace algorithm {
>>>
>>> template< typename Range1T, typename Range2T >
>>> std::size_t levenstein_distance( Range1T first, Range2T second );
>>>
>>> }
>>> }
>>>
>>
>> Sure, that would be possible, but I was originally aiming at something
>> little bit different. Basically, I want to design an algorithm that
>> would find all (or just the first) occurrences of a given pattern with
>> given distance in some string.
>>
>> So to your question, simple interface could look like this:
>> namespace boost
>> {
>> namespace algorithm
>> {
>> template <typename T>
>> class approximate_distance;
>>
>> template <typename TRange>
>> find_iterator<TRange> approximate_find
>> (TRange text,
>> approximate_distance<TRange>& model);
>> }
>> }
> ...
>
> I would recommend starting off with the basic implementations of these
> distance algorithms if possible. There are several situations where I
> currently use my own Levenshtein distance implementation, and could use
> other alternative 'distance' measures, but the approximate_find is useless
> here.
>
> Of course it could be useful in other situations, but I would suggest
> exposing the underlying distance algorithms first, so they can then be
> adapted into further utilities like approximate_find or used directly as
> necessary.
>
> -Adam D. Walling

OK, I should definitely include these "word" distance functions as well.

Jan Strnad


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk