|
Boost : |
Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Vadim Stadnik (vadimstdk_at_[hidden])
Date: 2013-04-27 02:28:17
On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang <sunlightt07_at_[hidden]> wrote:
> 2013/4/27 Erik Erlandson <eerlands_at_[hidden]>
>
> >
> > > Here is a prototype for the interface:
> > > Part A. basic similarity metrics (use edit distance as an example)
> > > (1) int edit_distance(const Range1T&, const Range2T&); // return edit
> > > distance.
> > > (2) int edit_distance(const Range1T&, const Range2T&, int threshold);
> > > // return edit distance if it is no larger than threshold, else
> > > return -1 or threshold + 1.
> > > (3) int edit_distance(const Range1T&, const Range2T&, equal_func);
> > > // use uqual_func to determine whether two elements are equal or not.
> >
> > For these first three, I'd typically expect them to return an unsigned
> > integer. The ideal type might be the size_type of implied sequence
> > containers, if it can be acquired, since the maximum possible distance
> is a
> > function of the maximum supported sequence size.
>
>
> Thanks, you're right. For edit distance, return size_type is really a
> better choice. For some other
> similarity metrics whose result is a real number in [0,1], I will choose
> double as the return type.
>
>
For those who use STL distance is signed integer. If you choose unsigned
integer, then I suggest reconsider function names to avoid potential
confusion for users. For example, is it "deviation" that you name
"distance"?
> > Part B. string similarity search and join algorithms, focus on edit
> > distance
> > > class searcher
> > > {
> > > void insert(IDType id, const string &word); // insert a word to the
> > index,
> > > id is used as an identifier of this word.
> > > void erase(IDType id); // erase a word from the index, id should be
> the
> > > same one used in insertion process.
> > > vector<Result> search(const string &query, int threshold);
> > > // performing similarity search, which will return all the strings
> which
> > > are similar to the query string. Result may be a pair of the id and
> real
> > > edit distance.
> > > };
> >
> >
> > This searcher class has the feel of some kind of container class, and
> > maybe ought to be designed with an eye toward satisfying one of the
> > standard container interfaces.
>
>
> Thanks, I will consider it carefully. Currently I'm not sure about whether
> I can support one of the standard container
> interfaces. Probably I can. With the candidate strings are inserted and
> removed, I should update the index for them.
>
>
IMO, container should be parameterized to allow users choosing the best
performing container. Strings based on augmented data structures offer a
wide set of very efficient search and update operations, including
logarithmic time splice and split. They can give significant performance
improvement against strings based on basic data structures, such as
std::string, std::vector<char>, std::list<char> etc.
Regards,
Vadim Stadnik
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk