Boost logo

Boost :

Subject: Re: [boost] RFC: edit_distance / edit_alignment library
From: Erik Erlandson (eje_at_[hidden])
Date: 2013-07-24 11:23:22


----- Original Message -----
> > I am grappling with how best to represent the returned "edit
> > script".
>
> "Best" obviously depends on what the caller is going to do with it.
> A good design for the interface should emerge once some experience
> has been gained with using the algorithm in actual applications.
> Attempting to design a general-purpose interface too soon can be
> a mistake.

>
> Having said that, the most generic way to return the edit script
> would be to template the algorithm on an output-handling class:
>
> template <typename ITER1, typename ITER2, typename OUTPUT>
> void diff(ITER1 begin1, ITER1 end1, ITER2 begin2, ITER2 end2, OUTPUT& output)

I had a design using this approach, which has obvious advantages. The one thing I wasn't crazy about is that to do this, the implementation layer is committed to always computing and/or storing start+length information and edit-op cost information, so that the output handler has its space of choices about which information it cares about. My motivation for the draft I presented is that it leverages some template specializations so that it computes exactly the information requested via the adaptors. For example, if you don't ask for edit cost information, it invokes version of the implementation that doesn't spend the effort to unpack that information from the working matrix.

On the other hand, quite possibly the Dijkstra and Meyers algorithms don't require that kind of extra work, as they don't operate on that kind of working matrix.

I agree strongly with your point about leveraging application experience to make good general-application choices. In fact, the last time I worked with edit distances or dynamic programming was nearly 20 years ago, and it was not on very long sequence data such as base-pairs or large-file diffs. Still, the universe of outputs from these algorithms isn't big: They return a sequence of edit op-codes. Optionally, they can include location information (start/end) and operation costs. It seems possible to deduce a small set of output facilities that provide good coverage. Not that I don't prefer induction to deduction :-)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk