Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Erik Erlandson (eerlands_at_[hidden])
Date: 2013-04-26 18:44:10
> Here is a prototype for the interface:
> Part A. basic similarity metrics (use edit distance as an example)
> (1) int edit_distance(const Range1T&, const Range2T&); // return edit
> (2) int edit_distance(const Range1T&, const Range2T&, int threshold);
> // return edit distance if it is no larger than threshold, else
> return -1 or threshold + 1.
> (3) int edit_distance(const Range1T&, const Range2T&, equal_func);
> // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance is a function of the maximum supported sequence size.
> (4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func,
> delete_cost_func, substitute_cost_func); // use three cost functions to
> calculate the cost for one insert/delete/substitute operation.
For this last one, I would want the return type for distance to be governed by the return type of the cost functionals. So, if insert_cost_func and friends return float, then the edit distance would also return float.
Even more generally, I might want the final return type for all edit_distance variants to be selectable as a template parameter (and perhaps influence the internal types used to compute it). I'm open to ideas here.
> (5) some functions to return the sequence of operations?
I think that since you are focusing on supporting the search and join algorithms, distance variants that can return the sequence of edit-ops ought to be low priority. They could be added in the future, without impacting the library design.
> Part B. string similarity search and join algorithms, focus on edit distance
> class searcher
> void insert(IDType id, const string &word); // insert a word to the index,
> id is used as an identifier of this word.
> void erase(IDType id); // erase a word from the index, id should be the
> same one used in insertion process.
> vector<Result> search(const string &query, int threshold);
> // performing similarity search, which will return all the strings which
> are similar to the query string. Result may be a pair of the id and real
> edit distance.
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
I noticed you were considering only supporting one category of sequence distance function (edit_distance) here, but if you make the distance-function into a template parameter, you might be able to use it with all your planned distance functions "for free"
> class joiner
> vector<PairResult> join(const string_and_id_container &R,
> const string_and_id_container &S, int threshold);
> // performing similarity join. Usually we are doing on-line join, which
> we cannot pre-build the index. So we only need this one function. It will
> return all the <r,s> pair where r is in R and s is in S and r is similar to
> PairResult may be a tuple of r's id, s's id and real edit distance.
> (4) Proposed Milestones and Schedule
> Now â June 15
> To be more familiar with the Boost library. Discuss the details and
> interfaces with my mentor and on the develop mail list.
> June 15 â July 15
> Develop a beta version of the approximate matching library. Most functions
> will be done at this time.
> July 15 â August 1
> Review my code and improve its performance and quality. Discuss with my
> mentor for mid-term evaluation and future works.
> August 1 â September
> Write documentation, test code for my library. Make it "boost" like.
> Unsubscribe & other changes:
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk