Boost logo

Boost :

Subject: Re: [boost] [algorithm] Review Request: edit_distance
From: Erik Erlandson (eje_at_[hidden])
Date: 2014-02-03 17:36:50


----- Original Message -----

> >
> > I've been mulling this over, and I have an idea that is starting
> > to grow on me.
> >
> > What if we left insertion() and deletion() (and substitution) as-is:
> > they continue to receive individual sequence elements, and operation
> > scores. But we *do* modify the equality() method to take two ranges,
> > in the manner you suggested above. It would introduce an asymmetry
> > of equality() with respect to the other methods, but I think there
> > are some real arguments for it:
>
> [snip]
>
> > *) Runs of 'equal' dominate the script in most larger-scale applications
> > of interest. Actual insertions, deletions, etc, are relatively small.
>
> I can't agree with that; for me, a common case is where the two inputs
> are entirely different.

Something of a tangent, but I'm curious how that works for you. When two strings are entirely different, the algorithm cost is O(NxM). On larger data, say a million elements, that's terabytes of working data, and tera-ops.

>
> Here's another possibility: don't pass ranges, but do pass iterators:
>
> struct output {
>
> ITER pending;
> enum ... last_action;
> void insertion(ITER i, int) {
> if (last_action != insert) {
> do_something_with(pending,i);
> pending = i;
> last_action = insert;
> }
> }
> // etc.
>
> };
>
>
> The point here is that I just need to store the iterator for the start
> of the range until I get a different action; in contrast, if the value
> is passed I need to accumulate a copy of the range.

That seems promising. It makes it straightforward to handle the script data either as it appears, or save up an iterator range. I could use a similar variation on the 'diff' example, and not bother to copy the file data, just refer back to the original for output.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk