Boost logo

Boost :

Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Yu Jiang (sunlightt07_at_[hidden])
Date: 2013-04-28 02:13:07


2013/4/28 Erik Erlandson <eerlands_at_[hidden]>

>
>
> > > I agree to modify the return type for function (4). So now I think your
> > suggestion for using a class to supply
> > these 3 functions together is a better choice. In case (1)(2)(3), I think
> > return size_type seems to be better since
> > the algorithm itself can only output an integer as the result?
>
> I think in most cases the result would be integer. I did once implement
> a variation where I used (1.5) for the cost of substitution, and (1.0) for
> insertion/deletion. I no longer remember the details, but I do recall that
> using (2.0) didn't work as well, it wanted to be (1.5). At any rate, the
> resulting distance in such a case would be floating point, not integer.
> If it's reasonably easy to support configurable compute/return types,
> including non-integers, I think it might prove useful.
>
>
Yes, I totally agree with you. In case (1)(2)(3), there isn't a parameter
for user to change the default cost(1/1/1/0).
So maybe return an integer value is enough. In case (4), the user can
change the return type if he likes.

> I know that you are going to be focusing on actual strings for your
> applications, but there are other kinds of edit distance application, for
> example a use case like the unix diff command, where you are doing
> edit-distance-like computations on two sequences of lines of text, and in
> fact the edit cost between any two lines may itself be computed as an edit
> distance. So I think designing the routines to be as general as possible
> (within reason) will pay dividends for the boost community, and improve
> their adoption.
>
> I may not want to focus on actual strings for Part A. similarity
functions, since here actual strings and
a container are similar to deal with. I also think it's easy for me to
compute the sequence for the actions.

Maybe the new interface can be like this:

Part A, similarity metrics(use edit distance as an example):
template....
(1) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&);
(2) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&,
(size_type or unsigned) threshold);
(3) (size_type or unsigned) edit_distance(const Range1T&, const Range2T&,
equal_func);
(4) (user-defined type) edit_distance(const Range1T&, const Range2T&, cost);

the cost struct may be like this:
template<T, V>
struct cost
{
   V substitution(const T&, const T&) const;
   V insertion(const T &a) const;
   V deletion(const T &a) const;
}

(5) pair<container of operations, edit_distance>
edit_distance_with_operation(const Range1T&, const Range2T&);
(6) pair<container of operations, edit_distance>
edit_distance_with_operation(const Range1T&, const Range2T&, threshold);
(7) pair<container of operations, edit_distance>
edit_distance_with_operation(const Range1T&, const Range2T&, equal_func);
(8) pair<container of operations, edit_distance>
edit_distance_with_operation(const Range1T&, const Range2T&, cost);
the "operation" may be a pair<"operator", "position">, where "operator" may
be an enum of insertion, deletion and substitution, "position" is an
unsigned value.

 _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk