Subject: Re: [boost] RFC: edit_distance / edit_alignment library
From: Erik Erlandson (eje_at_[hidden])
Date: 2013-11-07 10:17:27
I reworked both the algorithms and the interface.
I devised a variation on the Dijkstra Single Source Shortest Path algorithm, optimized for the particular structure of an edit graph. Like the Meyers algorithm, it is very efficient on long sequences with localized differences (e.g. computing the diff of two similar files, or comparing two similar DNA sequences). Under such conditions, I've been comparing two similar sequences with 100 million elements in under a second.
I went with an SSSP-based approach because the Meyers algorithm makes assumptions about the edit operation cost function, and a general-purpose library should allow any cost function (think std::sort() or std::map<> allowing any definition of '<'). I don't feel like I sacrificed a lot of performance. It's quite fast. I doubt it would be hard to allow the user to provide a specialized 'meyers_cost' object, and have it trigger a specific Meyers algorithm under the hood.
I also upgraded the interface to a named-argument style function, using the boost::parameter library, so calls now support some named optional parameters:
// apply an optional custom cost, and an optional beam constraint
int d = edit_distance(seq1, seq2, _cost = my_cost(), _beam=100);
In the case of edit_alignment(), the actual sequence of edit operations is received by a user-supplied output handling object:
// output_handler provides methods for handling insertion, deletion, substitution and equal elements
int d = edit_alignment(seq1, seq2, output_handler, _cost=my_cost(), _beam=1000);