Boost logo

Boost Users :

From: Nat Goodspeed (ngoodspeed_at_[hidden])
Date: 2006-08-29 10:42:13


> -----Original Message-----
> From: boost-users-bounces_at_[hidden] [mailto:boost-users-
> bounces_at_[hidden]] On Behalf Of david v
> Sent: Tuesday, August 29, 2006 10:02 AM
> To: boost-users_at_[hidden]
> Subject: [Boost-users] Mismatch and regex newbie problem still problem
>
> It may sound weird to you but the way i'm using the regex is to
identify
> genomic regions, so in other words for biological applications.
> In some cases my regex is a piece of DNA such as "atgcta" and i want
to
> search for this regex in another piece of DNA.

[Nat] That's not weird. I wasn't questioning your desired processing,
just trying to figure out where the disconnect lies.

> in some cases i want to be able to search for "atgcta" but i
> want to allow some mismatches. Obviuously i will even get more matches
but
> i
> think regex can be a more much efficient way that by building ip
aligment
> matrices.

[Nat] This is where you have to get really specific about what kinds of
mismatches you want to recognize. For example, will the sequence always
begin with "at" and end with "ta" and be separated by exactly two items?
("atxxta") In that case regex is perfect for the problem. If the
variable items are at fixed positions within the original pattern, it's
easy.

But since I don't yet know the full set of cases you intend to allow as
a "mismatch," there's room for me to speculate that you might want to
find "xtgcta" or "axgcta" or "atxcta" or "atgxta" or "atgcxa" or
"atgctx" (one item wrong) or "xxgcta" or "xtxcta" or ...

That's for a sequence length of 6. The full list of permutations is so
long that you'd want to generate it programmatically. Step up to a
length of 7 and that list gets alarmingly longer. That's what I meant
when I said that it could quickly explode.

How many items must match the original pattern before you recognize a
valid "mismatch"? Presumably it's at least 1, otherwise you'd validly
"mismatch" every position in every string. In practice, matching only 1
item out of 6 makes little sense to me either. You need to define what
does make sense.

It may be that regex is still the best tool for the job. But depending
on the full set of possibilities that you mean by "mismatch," you might
have to hand-code the (mis)match testing instead. I can imagine
generating a regex string that would choke the library.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net