Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2006-10-29 11:43:25


Dave Druelinger wrote:
>> I'm using Boost.Regex to match a long, complicated pattern to
>> between 30 and 50 long strings. I loop through and call
>> regex_search until it doesn't find any more matches in the current
>> string and then it moves on to the next string. The problem is that
>> when it has looped through about 40 of the strings, Boost.Regex will
>> hang on the regex_search() function, and it will just sit there
>> processing forever. If I reduce the number of strings I match
>> against there is no problem; 20-30 iterations works like a charm.
>> It also seems to help if I reduce the size of the regex pattern.
>>
>> Frankly I have no clue what could be causing the problem, and any
>> help would be greatly appreciated.

I think you need to isolate the problem: figure out which specific string
against which specific pattern is causing the problem. [Aside: should you
happen to be using Visual Studio, wait till it hangs, and then do a
"break-all" and work up the call stack to see what the string and pattern
was].

Which Boost.Regex version are you using? In all recent releases there is
code to detect pathological cases and throw an exception so you don't get
into this situation.

BTW: the issue here is that matching Perl style regexes is an NP-Hard
problem in general. Usually you will never hit that kind of situation
though :-) When things go wrong, it's normally nested "ambiguous" repeats
combined with a particular pattern of text in the string - long sections of
whitespace for example. The trick is to try and optimise your regex so that
each time the state machine has to make a decision, there is only one way it
can go for any given input characater.

For example if you have:

(\s*something-to-be-matched\s*)+

the expression may behave badly when it sees long sections of whitespace,
since those two \s*'s are effectively right next to each other, but if we
make it less ambiguous:

\s*(something-to-be-matched\s*)+

then it will be fine, because you don't have two \s*\s* 's next to each
other.

Hopefully, I'm making sence, and this will trigger an idea :-)

Regards, John Maddock.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net