Boost logo

Boost :

Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Yu Jiang (sunlightt07_at_[hidden])
Date: 2013-04-27 01:00:45

2013/4/27 Erik Erlandson <eerlands_at_[hidden]>

> > Here is a prototype for the interface:
> > Part A. basic similarity metrics (use edit distance as an example)
> > (1) int edit_distance(const Range1T&, const Range2T&); // return edit
> > distance.
> > (2) int edit_distance(const Range1T&, const Range2T&, int threshold);
> > // return edit distance if it is no larger than threshold, else
> > return -1 or threshold + 1.
> > (3) int edit_distance(const Range1T&, const Range2T&, equal_func);
> > // use uqual_func to determine whether two elements are equal or not.
> For these first three, I'd typically expect them to return an unsigned
> integer. The ideal type might be the size_type of implied sequence
> containers, if it can be acquired, since the maximum possible distance is a
> function of the maximum supported sequence size.

Thanks, you're right. For edit distance, return size_type is really a
better choice. For some other
similarity metrics whose result is a real number in [0,1], I will choose
double as the return type.

> > (4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func,
> > delete_cost_func, substitute_cost_func); // use three cost functions to
> > calculate the cost for one insert/delete/substitute operation.
> For this last one, I would want the return type for distance to be
> governed by the return type of the cost functionals. So, if
> insert_cost_func and friends return float, then the edit distance would
> also return float.
> Even more generally, I might want the final return type for all
> edit_distance variants to be selectable as a template parameter (and
> perhaps influence the internal types used to compute it). I'm open to
> ideas here.
> I agree to modify the return type for function (4). So now I think your
suggestion for using a class to supply
these 3 functions together is a better choice. In case (1)(2)(3), I think
return size_type seems to be better since
the algorithm itself can only output an integer as the result?

> (5) some functions to return the sequence of operations?
> I think that since you are focusing on supporting the search and join
> algorithms, distance variants that can return the sequence of edit-ops
> ought to be low priority. They could be added in the future, without
> impacting the library design.
> >
> > Part B. string similarity search and join algorithms, focus on edit
> distance
> > class searcher
> > {
> > void insert(IDType id, const string &word); // insert a word to the
> index,
> > id is used as an identifier of this word.
> > void erase(IDType id); // erase a word from the index, id should be the
> > same one used in insertion process.
> > vector<Result> search(const string &query, int threshold);
> > // performing similarity search, which will return all the strings which
> > are similar to the query string. Result may be a pair of the id and real
> > edit distance.
> > };
> This searcher class has the feel of some kind of container class, and
> maybe ought to be designed with an eye toward satisfying one of the
> standard container interfaces.

Thanks, I will consider it carefully. Currently I'm not sure about whether
I can support one of the standard container
interfaces. Probably I can. With the candidate strings are inserted and
removed, I should update the index for them.

I noticed you were considering only supporting one category of sequence
> distance function (edit_distance) here, but if you make the
> distance-function into a template parameter, you might be able to use it
> with all your planned distance functions "for free"

Good suggestion. But it seems to be hard. Maybe a high performance
algorithm can only support a specific
metric(I will not implement the brute-force method here). Since the edit
distance is the most useful one, I'd like to write an algorithm for it.

Thanks a lot!

Boost list run by bdawes at, gregod at, cpdaniel at, john at