|
Boost : |
Subject: Re: [boost] RFC: edit_distance / edit_alignment library
From: Erik Erlandson (eje_at_[hidden])
Date: 2013-07-23 12:35:49
----- Original Message -----
> Hi Erik,
>
> Erik Erlandson wrote:
> > Boost Community,
> >
> > I cooked up a library that provides a pair of functions: edit_distance()
> > and edit_alignment(). The library currently lives here:
> > https://github.com/erikerlandson/edit_distance
>
> I have had a quick look. It seems that you've implemented the
> Needleman-Wunsch
> algorithm, which takes quadratic time and space. There are less
> complex algorithms,
> at least for similar input sequences, e.g. Myers - "An O(ND) Difference
> Algorithm
> and Its Variations"; I have an implementation of that here:
> http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
Than you for the Myers paper! I've been collecting some thoughts on this topic.
I would certainly want Myers' algorithm (or something like it) in a Boost sequence alignment library for its performance on large sequence data (previously I had been thinking in terms of Hirschberg, which is less efficient). I do also wish to provide an algorithm that supports substitution as an edit operation, and configurable cost functions whose output may vary with the sequence elements. Replacing Needleman-Wunsch with a Dijkstra/SSSP implementation under the hood seems like a good idea.
I am grappling with how best to represent the returned "edit script". Considering the Myers paper, some kind of run-compression seems useful. The information required depends on what operation. To represent a run of 'equal elements', you need begin/length for sequence 1 and sequence 2. Substitution is similar. For insertion and deletion, you only need begin/length for one sequence. One might use boost::any to address the differing information, although it seems to involve overhead. Not sure if there is a clear boost-style answer that stands out.
In my draft, I also experimented with allowing selection of varying edit-script information via function adaptors. The useful options there differ between Myers, where edit costs are constant, and my desired variant that allows non-constant edit costs.
The minimal edit script output in any case is a pair (op-code, run-length). If one wants additional location information, then the op-dependent variation appears, adding begin-index for either one or two sequences. In the case of Myers, edit cost information is not relevant. I see no real problem with supporting different options for different algorithms, as long as the interface is as consistent as possible. I still like this as a feature, although with run compression one might argue that the benefits decrease.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk