Boost logo

Boost :

Subject: Re: [boost] [gsoc-2013]Approximate string matching
From: Erik Erlandson (eje_at_[hidden])
Date: 2013-04-25 11:46:16


----- Original Message -----
> 2013/4/23 Erik Erlandson <eerlands_at_[hidden]>
> >
> > How one would design the interface for edit_distance<> is an interesting
> > question. In the basic general case, edit_distance<> is a function of two
> > sequences, and a cost matrix for insertion/deletion/substitution/equality.
> > In boost/template world, the "equality" test would of course be a template
> > parameter, with the expected default. These days, I'd expect to be able to
> > provide sequences as either begin/end iterators, or ranges. The cost matrix
> > could be either provided as numeric values or functionals, with defaults
> > to: 1/1/1/0.
> >
>
> Thanks for your advice! I'm working on writing a proposal. Looking forward
> to your further advice on my proposal.
>

A few more thoughts:

*) Cost functionals should take parameters, probably something like:

struct substitution_cost {
    double operator()(const T& a, const T& b) const {
        // return some distance metric between a and b that represents
        // cost of substitution. In the classic case, this just returns 1
        // if a != b, and zero if a == b.
    }
}

struct insertion_cost {
    double operator()(const T& a) const {
        // cost of deleting element (a), classic case is also just 1.
    }
}

*) Using the above framework, substitution and equality are really the same. For example, if you are substituting (a) for (b), and it so happens that a == b, then the substitution cost may very well return as zero (but it's up to you!). So really there are only three edit operations: insertion, deletion and substitution, where substituting two equal elements will commonly be defined to return zero for the cost.

*) Also notice that in this framework, you don't need to supply a separate equality operator, since that logic is now subsumed by substitution cost functional.

*) I can think of three plausible ways to supply the edit operation cost-functionals. The first is of course just three template parameters. This might be slightly inelegant if you want to use defaults for the first two, but alter the 3rd.

The second is to supply a single object that provides all three:

struct costs {
    double insertion(const T& a) const { ... }
    double deletion(const T& a) const { ... }
    double substitution(const T& a, const T& b) const { ... }
}

The third is to use a keyword argument scheme. Boost provides this, or at least it did in the past. The one time I tried it, I personally didn't care for it, but that was several years ago and if it's widely used in the community then it should probably be considered. I'd defer to community opinions.

*) Edit distance can either just return the distance and nothing else, or it can optionally return the actual sequence of edit operations and op-costs that yielded the distance measure. Returning just the cost allows for more internal efficiencies, but it would probably be useful to supply variations that also return the actual edit sequence.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk