Boost logo

Boost :

Subject: Re: [boost] [gsoc-2013] proposal for approximate string matching
From: Yu Jiang (sunlightt07_at_[hidden])
Date: 2013-04-27 09:52:54


2013/4/27 Vadim Stadnik <vadimstdk_at_[hidden]>

> On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang <sunlightt07_at_[hidden]> wrote:
>
> > 2013/4/27 Erik Erlandson <eerlands_at_[hidden]>
> >
> > >
> > > > Here is a prototype for the interface:
> > > > Part A. basic similarity metrics (use edit distance as an example)
> > > > (1) int edit_distance(const Range1T&, const Range2T&); // return edit
> > > > distance.
> > > > (2) int edit_distance(const Range1T&, const Range2T&, int threshold);
> > > > // return edit distance if it is no larger than threshold, else
> > > > return -1 or threshold + 1.
> > > > (3) int edit_distance(const Range1T&, const Range2T&, equal_func);
> > > > // use uqual_func to determine whether two elements are equal or not.
> > >
> > > For these first three, I'd typically expect them to return an unsigned
> > > integer. The ideal type might be the size_type of implied sequence
> > > containers, if it can be acquired, since the maximum possible distance
> > is a
> > > function of the maximum supported sequence size.
> >
> >
> > Thanks, you're right. For edit distance, return size_type is really a
> > better choice. For some other
> > similarity metrics whose result is a real number in [0,1], I will choose
> > double as the return type.
> >
> >
> For those who use STL distance is signed integer. If you choose unsigned
> integer, then I suggest reconsider function names to avoid potential
> confusion for users. For example, is it "deviation" that you name
> "distance"?

Thanks! I would prefer not to change the function name since in such area
"edit distance" is used everywhere so user may be familiar with this word.
IMO, STL distance returns signed integer because the order of the two
parameters matters. For this edit distance function, usually the order of
the two strings may not affect the result unless in (4). So I may prefer
to return a unsigned integer.

>
> > > Part B. string similarity search and join algorithms, focus on edit
> > > distance
> > > > class searcher
> > > > {
> > > > void insert(IDType id, const string &word); // insert a word to the
> > > index,
> > > > id is used as an identifier of this word.
> > > > void erase(IDType id); // erase a word from the index, id should be
> > the
> > > > same one used in insertion process.
> > > > vector<Result> search(const string &query, int threshold);
> > > > // performing similarity search, which will return all the strings
> > which
> > > > are similar to the query string. Result may be a pair of the id and
> > real
> > > > edit distance.
> > > > };
> > >
> > >
> > > This searcher class has the feel of some kind of container class, and
> > > maybe ought to be designed with an eye toward satisfying one of the
> > > standard container interfaces.
> >
> >
> > Thanks, I will consider it carefully. Currently I'm not sure about
> whether
> > I can support one of the standard container
> > interfaces. Probably I can. With the candidate strings are inserted and
> > removed, I should update the index for them.
> >
> >
> IMO, container should be parameterized to allow users choosing the best
> performing container. Strings based on augmented data structures offer a
> wide set of very efficient search and update operations, including
> logarithmic time splice and split. They can give significant performance
> improvement against strings based on basic data structures, such as
> std::string, std::vector<char>, std::list<char> etc.

Sorry for this. It seems I didn't describe my idea clearly. I'd like to do
something
like this: S is a large container of strings and when the user performing
searching,
we can return all the strings in S which are similar to the given
string(maybe the
edit distance between them is less than 2). So here I need to build a index
first.
In fact I want to supply some functions to manipulate the index, since the
user
may want to change the strings in S slightly. So maybe I'll make the
"searcher"
class look like a container, but there isn't a real container in it.

Thanks.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk