Boost logo

Boost :

From: John Maddock (John_Maddock_at_[hidden])
Date: 2000-08-16 05:52:11


Beman,

>But what if efficiency rears its ugly head? Industrial code I have seen
in
>effect re-ordered the regular expressions into groups according to common
>initial sub-expressions, determined which sub-expression applied, and then

>only made the pattern by pattern match attempts on patterns that stood a
>chance of matching. The logical extreme of this approach would be a tree
>search that did the minimum number of regex_match() calls.
>
>I don't hold it against regex if that isn't supported directly, but I
would
>like to hear John Maddock (or anyone else) speculate on the relative ease
>of implementing this challenge fairly efficiently on top of regex.

Back in the early days of regex++ someone mailed me about that exact
problem (well actually they were searching for dates across a hard drive
for a forensic application), in that case the "brute force" approach
(search for each expression in turn) worked well for them, but probably
would have been quicker had they spent some time developing a single regex
to do the job.

One option is to alternate each of the expressions and use parenthesis to
indicate priority:

(expression_1)|(expression_2)|(expression_3)

In this case the leftmost - longest rules (applied to the sub-expressions)
will produce the first of the list that matches, *provided* one of the
later matches can't produce a longer match (and I admit that that may be a
showstopper for searches as opposed to data validation). The advantage of
this approach is that it lets the regex compiler analyze all the
expressions "as one", how much that gains you will depend upon the regex
engine used - regexes dedicated to pre-compiled expressions (like flex)
will typically do much more analysis than "mixed usage" regexes like
regex++ (with regex++ you gain most if each of the alternatives is mutually
exclusive - for example one starts with 'a' and the other starts with
[^a]).

- John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk