Boost logo

Boost :

Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Erik Erlandson (eerlands_at_[hidden])
Date: 2013-04-26 18:44:10

> Here is a prototype for the interface:
> Part A. basic similarity metrics (use edit distance as an example)
> (1) int edit_distance(const Range1T&, const Range2T&); // return edit
> distance.
> (2) int edit_distance(const Range1T&, const Range2T&, int threshold);
> // return edit distance if it is no larger than threshold, else
> return -1 or threshold + 1.
> (3) int edit_distance(const Range1T&, const Range2T&, equal_func);
> // use uqual_func to determine whether two elements are equal or not.

For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance is a function of the maximum supported sequence size.

> (4) int edit_distance(const Range1T&, const Range2T&, insert_cost_func,
> delete_cost_func, substitute_cost_func); // use three cost functions to
> calculate the cost for one insert/delete/substitute operation.

For this last one, I would want the return type for distance to be governed by the return type of the cost functionals. So, if insert_cost_func and friends return float, then the edit distance would also return float.

Even more generally, I might want the final return type for all edit_distance variants to be selectable as a template parameter (and perhaps influence the internal types used to compute it). I'm open to ideas here.

> (5) some functions to return the sequence of operations?

I think that since you are focusing on supporting the search and join algorithms, distance variants that can return the sequence of edit-ops ought to be low priority. They could be added in the future, without impacting the library design.

> Part B. string similarity search and join algorithms, focus on edit distance
> class searcher
> {
> void insert(IDType id, const string &word); // insert a word to the index,
> id is used as an identifier of this word.
> void erase(IDType id); // erase a word from the index, id should be the
> same one used in insertion process.
> vector<Result> search(const string &query, int threshold);
> // performing similarity search, which will return all the strings which
> are similar to the query string. Result may be a pair of the id and real
> edit distance.
> };

This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.

I noticed you were considering only supporting one category of sequence distance function (edit_distance) here, but if you make the distance-function into a template parameter, you might be able to use it with all your planned distance functions "for free"

> class joiner
> {
> vector<PairResult> join(const string_and_id_container &R,
> const string_and_id_container &S, int threshold);
> // performing similarity join. Usually we are doing on-line join, which
> means
> we cannot pre-build the index. So we only need this one function. It will
> return all the <r,s> pair where r is in R and s is in S and r is similar to
> s.
> PairResult may be a tuple of r's id, s's id and real edit distance.
> };
> (4) Proposed Milestones and Schedule
> Now – June 15
> To be more familiar with the Boost library. Discuss the details and
> interfaces with my mentor and on the develop mail list.
> June 15 – July 15
> Develop a beta version of the approximate matching library. Most functions
> will be done at this time.
> July 15 – August 1
> Review my code and improve its performance and quality. Discuss with my
> mentor for mid-term evaluation and future works.
> August 1 – September
> Write documentation, test code for my library. Make it "boost" like.
> _______________________________________________
> Unsubscribe & other changes:

Boost list run by bdawes at, gregod at, cpdaniel at, john at