|
Boost : |
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-05-15 12:31:35
JD wrote:
>> I have an implementation of the "diff" (i.e. edit script) algorithm here:
>>
>> http://svn.anyterm.org/anyterm/trunk/common/diff.hh
>> http://svn.anyterm.org/anyterm/trunk/common/diff.cc
>>
>> The "edit distance" is in there somewhere.
>>
>> I would welcome this sort of thing in Boost. But it needs to be the
>> best-known algorithm. IIRC, there was one improvement that I found in
>> the literature but didn't implement because I didn't fully understand it.
>>
>> Please correct me if I'm wrong, but I get the impression that your code
>> is O(NM) where N and M are the sizes of the two input strings. The
>> algorithm that I implemented is O((N+M)D) where D is the edit distance
>> in the worst case, and typically O(N+M+D^2). The algorithm is
>> described in the .cc file linked above. Also, your code seems to have
>> O(NM) space requirement; I'm not sure what the lower bound is, but I
>> think it's probably O(N+M).
>
> I improved the algorithm and now the space requirements falls to O(m).
> Computational is still the same, however. Having a very quick look at
> your algo (in diff.cc) I can see it's more complex in a way that it is
> not understandable at first read.
Indeed it's quite complex, but in the comments I refer to a couple of
papers which I suggest that you read. The operation is much easier to
understand if you can visualise it, and my attempts at ASCII-art
diagrams are not as good as the ones in those papers.
There is probably other material in the literature; does Knuth have a
chapter on this subject?
> As I can see in the Boost Guideline
> (http://www.boost.org/more/lib_guide.htm#Guidelines):
>
> "Aim first for clarity and correctness; optimization should be only a
> secondary concern in most Boost libraries."
>
> So I think unless there is an obvious way to reduce computational cost
> to O(m) (the same way it was obvious to reduce space usage to the same),
> I think we should stick to the original algorithm.
I would disagree with that view. Let's quantify quite how the
performance of the algorithms differ. My application was to compare
the contents of a 25x80-character terminal window before and after the
user has changed something; typically they have just typed one or two
characters. So N = M = 25x80 = 2000, and D is typically 2. Your O(NM)
algorithm takes 4 000 000 time, while mine takes typically O(N+M+D^2) =
4004 i.e. 1000 times faster, worst case O((N+M)D) = 8000 i.e. 500 times faster.
I am not sure what the author of the guideline that you have quoted had
in mind, but I would have thought that that was not thinking of large
big-O complexity issues. My personal approach to optimisation has
always been to make sure that the big-O complexity is right; it is then
rarely necessary to consider optimising anything else.
Regards,
Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk