Boost logo

Boost Users :

From: Brian Allison (brian_at_[hidden])
Date: 2006-08-29 10:31:28


david v wrote:

>Yes i think there were some misunderstanding here.. I think that comes by
>the definition you have of mistake. A mistake for me is as follows:
>
>Regex: "testing"
>String_to_search: "tastung".
>The output should be that the regex testing was found but with 2 mismatches
>that are "a" and "u". So a mismatch is a letter that was not found.
>
>
>It may sound weird to you but the way i'm using the regex is to identify
>genomic regions, so in other words for biological applications.
>In some cases my regex is a piece of DNA such as "atgcta" and i want to
>search for this regex in another piece of DNA. Given the fact that the regex
>"atgcta" can be found in the genome many times i will get probably get a lot
>of matches. But in some cases i want to be able to search for "atgcta" but i
>want to allow some mismatches. Obviuously i will even get more matches but i
>think regex can be a more much efficient way that by building ip aligment
>matrices.
>
>ANy idea how to handle the example above
>

David,

  The problem you describe is fundamentally different than that of
Regular Expressions.

  A regex pattern "can't count", and is limited to anything that can be
found with a 1-symbol language. (Sorry, obligatory obscure reference to
theory.)

  In plain English, a regex can only list the possible things it matches
against. It's sort of like having a long list of possibilites, even an
infinite list, but a list nonetheless. It's critical to understand that
no state can be held. In other words, while we can form a regex to find
a*b*, we can't use a regex to find out if the pattern matches (a^n)(b^n)
where n in any arbitrary value that we don't specify. A regular
expression would have to be able to store the number of found 'a' in
some state and then compare the number of found 'b' against it.
  Regular expressions are not capable of storing state, so trying to
find the answer to your question is in a region of research called
"approximate string matching".

Definitions:
  Edit distance: the number of single character changes to change one
string S into another string R.

  To be a little more precise, the two most popular edit distance
definitions are:
    * Levenshtein edit distance, which allows for the single-character
insertion, deletion, or substitution operations.
    * Damerau edit distance, which also allows for the transposition of
adjoining characters as a single operation.

  Given that the operations are all treated as having cost 1, then the
"cost" of changing S to R is the "edit distance" between the strings.

  Your problem is not a regular expression problem, but you seem to want
to search for all strings {R} which are within an edit distance of N.
(Most people choose 2, and use an algorithm that blows up badly for N>3.)

  Depending on the constraints in your actual problem, you are left with
various approaches. Let me ask some questions:

  1) What's the longest string you're searching for?
  2) What's the smallest bit-width of any cpu on which you're running
your code?
  3) How familiar with string searching/matching are you?

  After seeing the answers to your questions, I may be able to point you
to one or more papers which can help you out. If you want to just start
reading, you can google the names Navarro, Hyyro, and the term "edit
distance".
  If you want a historical perspective, the name "Esko Ukkonen" is
canonical.

  Hope this helps,
Brian


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net