Boost logo

Boost :

Subject: Re: [boost] [algorithm] Review Request: edit_distance
From: Erik Erlandson (eje_at_[hidden])
Date: 2014-02-02 16:28:25


----- Original Message -----

> - It may be useful to group contiguous operations of the same
> kind, i.e. a "run" of N equal characters would result in one
> call to the output script object, rather than N calls. This
> can be achieved by merging in the script object, but that
> would be less efficient than passing the entire run as an
> iterator pair:
>
> struct output {
>
> template <typename ITER_T>
> void insertion(ITER_T begin, ITER_T end) {
> do_something_with(begin,end);
> }
>
> };
>

I've been mulling this over, and I have an idea that is starting to grow on me.

What if we left insertion() and deletion() (and substitution) as-is: they continue to receive individual sequence elements, and operation scores. But we *do* modify the equality() method to take two ranges, in the manner you suggested above. It would introduce an asymmetry of equality() with respect to the other methods, but I think there are some real arguments for it:

*) the order of insertions and deletions identified by these algorithms is not generally "consecutive" - that is, if the edit script involves a sequence of deletions and insertions, you cannot count on these being presented consecutively, and that will tend to undermine the benefit of accepting such ranges in the handling methods.
*) the algorithms *do* identify consecutive runs of equal elements.

*) the insertion() and deletion() methods receive operation-cost information, and it isn't trivial to fold that information into an iterator stream.
*) the equality() method does *not* receive operation cost information, because that cost is always zero by definition. That means it is sensible (and easy) to simply present a range constructed of the iterators at the start and end of the equal run in question. This begin/end information for equal runs is readily available, and saves cost both internal to the algorithms, and (likely) externally as well.

*) Runs of 'equal' dominate the script in most larger-scale applications of interest. Actual insertions, deletions, etc, are relatively small. Compressing equal runs in this way is where almost all of the bang/buck resides.

*) This is more speculative, but just looking at how I ended up doing my 'diff' benchmark implementation, accepting ranges for insertion() and deletion() would actually have required me to do a bit *more* coding, to handle the logic of appending an incoming range of data onto my 'ins' and 'del' vectors, compared to simply push_back() as I did:
https://github.com/erikerlandson/algorithm/blob/edit_distance/sequence/example/edit_distance_diff_example.cpp#L69


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk