Boost logo

Boost :

From: Marco (mrcekets_at_[hidden])
Date: 2007-05-18 07:07:12


I read the article pointed out by Phil Endecott:
An O(NP) Sequence Comparison Algorithm
where P=D/2 when M=N and D is the actual edit distance,
as said in the abstract:

Let A and B be two sequences of length M and N respectively, where without
loss of generality
N > M, and let D be the length of a shortest edit script between them. A
parameter related to D is
the number of deletions in such a script, P = D/2 − (N −M)/2. We present
an algorithm for
finding a shortest edit distance of A and B whose worst case running time
is O(NP) and whose
expected running time is O(N + PD). The algorithm is simple and is very
efficient whenever A is
similar to a subsequence of B. It is nearly twice as fast as the O(ND)
algorithm of Myers [9], and
much more efficient when A and B differ substantially in length.

the space required is O(M+N)

the pseudocode is enougth simple and it's almost straightforward to
translate it in C++

Algorithm Compare
Begin
   fp[−(M+1)..(N+1)] := −1;
   p := −1;
   Repeat
     Begin
       p := p + 1;
       For k := −p to D−1 do
         fp[k] := snake( k, max (fp[k−1] + 1, fp[k+1]) );
       For k := D + p downto D + 1 by −1 do
         fp[k] := snake( k, max (fp[k−1] + 1, fp[k+1]) );
      fp[D] := snake( D, max (fp[D−1] + 1, fp[D+1]) );
     End
   Until fp[D] = N;
   Write "The edit distance is: " D+2p;
End

Function snake( k, y: int) : int
Begin
   x := y − k;
   While x < M and y < N and A[x+1] = B[y+1] do
   Begin
     x := x + 1; y := y + 1;
   End
   snake := y;
End

FWIW, the article provides same performance comparison
against the Miller's algorithm (O(ND)), the one implemented by Phil I
guess.

              number of edit number of comparisons execution time
  M N deletions distance O(NP) O(ND) O(NP) O(ND)
4000 5000 10 1020 21564 526506 0.13 2.67
4000 5000 50 1100 59520 614391 0.41 3.12
4000 5000 100 1200 121635 737748 0.83 4.02
4000 5000 200 1400 255157 1004952 1.62 5.87
4000 5000 400 1800 600216 1693377 3.68 8.94
4000 5000 600 2200 1016433 2523687 6.41 14.85
5000 5000 200 400 49202 93139 0.33 0.61
5000 5000 600 1200 398499 791815 2.34 4.65

In the attached file there is a draft but working implementation.

this is a link to the article:
http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps

in reading it pay attention that the x-y coordinates are swapped and this
is a bit annoying.

Best Regards,
Marco

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk