 # Boost :

From: Stjepan Rajko (stipe_at_[hidden])
Date: 2007-05-12 17:17:56

OK, I was a little off about the maximum common subsequence thing -
getting the length of the maximum common subsequence would require:
- initialization of top row and left column of d to 0
- instead of min you use max (so instead of "costs" things become
"scores") - this is another useful thing to let the user specify
- "insertion" and "deletion" scores (moving horizontally or
vertically through table) become 0
- comparison score function (moving diagonally through table) is 1
for equal characters and 0 for unequal

I _think_ that should be right... but I'm still talking off the top of
my head :-) sorry for any mistakes.

S

On 5/12/07, Stjepan Rajko <stipe_at_[hidden]> wrote:
> > Wow that's just the algorithm found on wikipedia, re-written in c++ and
> > boostified for introduction in boost/algorithm/string. Nothing fancy
> > here, it is definitely basic dynamic programming. My question was more
> > to know if there is potential "users" interested and if it's the
> > appropiate place to propose it.
> >
>
> I know... I was just pointing out that by implementing edit distance,
> you're not far from a slew of other useful things. For example, can
> easily add support for a ton of related problems (maximum common
> subsequence distance, for example), just by allowing the user to
> specify how the top row and leftmost column of the dynamic programming
> matrix are initialized. In your current implementation, you
> initialize d[i]=d[i]=i: if instead you set d[i]=d[i]=0 (if
> I remember correctly), you get maximum common subsequence distance.
> Specifying the cost of insertion / deletion is another useful thing to
> allow (hard coded to 1 in your case right now), and providing a custom
> comparison cost function between characters is another (right now it
> is hard-coded to a cost of 1 if they are different and 0 if they are
> equal).
>
> The next layer of difficulty/usefulness is allowing affine
> insertion/deletion costs (e.g., an insertion of length N is given a
> cost of aN + b), but that requires two tables. And then, you're not
> far from a decent probabilistic pattern recognition mechanism :-)
>
> This is all especially useful if you also add traceback extraction,
> i.e. extracting the calculation path which yielded the optimal result.
> This gives you the string alignment which yields optimal edit
> distance, or yields the maximum common subsequence in the other
> initialization case.
>
> I'm not saying you should go ahead and implement all this, but there
> is definitelly a part of it that requires a small amount of effort but
> makes it a ton more useful - I'd say at least allowing the user to
> specify custom insertion, deletion, and comparison function costs, and
> extracting the optimal alignment in the end.
>
> Stjepan
>