Boost logo

Boost Users :

From: Mike Marchywka (marchywka_at_[hidden])
Date: 2008-02-12 07:56:53


>That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such>as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_;

The technique is obviously extendable. I'm doing a similar thing where I have boost regex and greta available as program
options but also my own code specialized for certain odd searches. I generally do 1000's of REGEX's against 10-100 strings
from 10-500k long each. To make the speed reasonable, I can compile the regex into a simple intermediate form
and record which features I will need during searching- if you find a '(" or '|' then you just need to take a slightly
different strategy then "omg.wtf". The overhead in doing a parse on a small regex is obviously lower than making
a pointless check everywhere along the samples.

It also may be worth considering indexing the string ( make something similar to database index of table of contents)
if you have a lot of REGEX's with single character matches or can identify the most unique part of the regex and look for that first.
If every REGEX has a 4 char exact match
requirement, ...ABCD..., there is little point checking for the complicated stuff when it isn't near the "ABCD" part.

Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka_at_[hidden]
Note: Hotmail is blocking my mom's entire
ISP claiming it is to reduce spam but probably
to force users to use hotmail. Please DON'T
assume I am ignoring you and try
me on marchywka_at_[hidden] if no reply
here. Thanks.

________________________________

Date: Mon, 11 Feb 2008 18:08:31 -0800
From: gene_at_[hidden]
To: boost-users_at_[hidden]
Subject: Re: [Boost-users] [Regex] "Truly" incremental search?

Renato golin wrote:

Eugene M. Kim wrote:

I am looking for a way to feed the input bytes only once, in order to
reduce the time complexity to approximately O(n). Of course I will lose
the ability to recover submatches, but the ability is unnecessary for
this particular application that I am considering. I cannot break the
regex into two parts ("omg" and "wtf") either because it is specified as
an input to the program.

Hi,

It may sound silly but what if you pre-process the input splitting it by
".*" and do an AND match for all sub-strings?

input = "omg.*wtf" => input[] = "omg", "wtf";
return (regexp_match(input[0]) && regexp_match(input[1]));

cheers,
--renato

PS: of course the code above is pseudo-code and of course you'll take as
many input parameters as available and not hard-coded like this.

That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_;

Eugene

_________________________________________________________________
Need to know the score, the latest news, or you need your Hotmail®-get your "fix".
http://www.msnmobilefix.com/Default.aspx


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net