Boost logo

Boost :

Subject: Re: [boost] [algorithm] Review Request: edit_distance
From: Erik Erlandson (eje_at_[hidden])
Date: 2014-01-28 16:01:26


----- Original Message -----
> Hi Erik,
>
> Erik Erlandson <eje_at_[hidden]> wrote:
> > I would like to request a formal review for inclusion
> > of the edit_distance() function.
>
> Congratulations on your progress.
>
> I have tried your code and done a quick benchmark comparison
> with my own version of Myers' algorithm. I'm pleased to see
> that you have excellent performance.

Thanks for looking at it, and benchmarking too, I really appreciate it!

>
> Some comments:
>
> - I think it should be possible to relax the requirement for
> random access iterators. Either continue to store integers
> and then use std::advance, or store iterators. I'm considering,
> for example, comparing UTF-8 with an iterator adaptor than
> changes a random-access iterator over the underlying bytes
> into a bidirectional iterator over the characters.

I briefly toyed with this idea earlier on, but it seemed (imo) like it would undermine the virtues of the algorithm. I guess what I mean by that is, Myers does stuff like storing only one index in the working vector 'V' for memory efficiency, and so we have "j2 = j1-k", which either requires index manipulation, or if you wanted to do it via iterators, it is only efficient for random-access iterators.

Similar issues apply to other checks, e.g. "if ((j1-j2) == (r1-r2) && j1 >= r1) return 2*D-1;"

Part of the genius of the algorithm is how baked-down it is, and trying to support non-random-access iterators would either turn certain operations from constant-time to linear time, or require carrying around indexes as well as iterators, at substantial additional memory cost. For that, I have the SSSP specialization.

However, I might be mis-reading the trade-offs. I'll keep thinking about it.

>
> - Have you looked at the supposedly better algorithm "An O(NP)
> Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene
> Myers ?

I have not, I'll read up on it.

>
> struct output {
>
> template <typename ITER_T>
> void insertion(ITER_T begin, ITER_T end) {
> do_something_with(begin,end);
> }
>
> };

One issue with the above is that the actual operation cost can be different for each element. So I can't get away with just passing begin/end iterators. I'd have to also provide some kind of iteration over the corresponding edit costs.

Unless one didn't care about the cost information. How to best handle script output has been thorny for me. Not every user is going to care about all the same information. My philosophy has been that I will provide it, and it can be ignored. Using in-line methods, ignoring it can even allow the compiler to avoid generating code around it, I think.

Another implication is that it requires doing what you did above -- writing handler methods templated to account for unknown/opaque iterator types, which seems less easy to think about than describing the handler methods as simply taking element values. I would also note that substitution() and equality() involve two iterator types, two begin/end pairs, etc.

However, the potential scaling benefits on large sequences might make some added complexity worthwhile.

>
>
> - Some might prefer a version that doesn't use Boost.Parameter.

I'm not sure how that would look: would it just take all 8 parameters all the time?

>
> - In the same vein, you might offer a version that takes an
> iterator pair rather than a range.

I agree -- I originally put off figuring out the examples that involved multiple function specializations combined with Boost.Parameter, but it could be time to revisit. Given that both range arguments and hypothetical iterator arguments are templates, I wasn't even sure if I could get it to properly differentiate.

Tangentially, I'm not totally clear if the 'range' view of the universe is The Future, and doing function interfaces taking iterator pairs is now deprecated, or if the idea is to support both schemes in parallel.

>
> - My experience has been that early termination is best expressed
> in terms of RAM, i.e. the caller could indicate a maximum amount
> of RAM, in bytes, rather than a maximum for the edit cost. You
> can probably convert between them via sizeof.

Agreed -- tangentially, I filed an issue for myself using maximum time:
https://github.com/erikerlandson/edit_distance/issues/6

>
> - I'm not convinced that the /sequence/ in the include path is
> useful. Why not just put it in boost/algorithm/ ?

I'm pretty open to ideas about where it should live. My logic is that there are other worthy proposals for sequence-related algorithms, and so it might be good to open up a new bucket that might include things besides edit_distance(). For example, last year a couple different proposals of this nature were submitted as GSoC candidates, like this one:
http://lists.boost.org/Archives/boost/2013/04/202971.php

>
> - It would be great to see a benchmark against GNU diff!

Also a good idea, although my impression of that code (which I poked around while doing this) is that it isn't factored in a way that would make dropping in a replacement very practical. I suppose the other option is to write a version of diff using edit_distance, which in fact is more or less the kind of project that got me to wanting a generic edit_distance library in the first place.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk