Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2007-03-27 05:50:27


Tommy McClung wrote:
> Here's my situation.
>
> We've got some software that makes use of the boost regex libraries
> and we've compiled and linked with the ICU libraries enabled. We
> need this for utf-8 support.
>
> Our platforms are Windows (XP and Vista) and OS X.
>
> I have a regular expression that is parsing an html page that is
> utf-8 encoded and it's a rather complex expression, but I've made
> sure to add anchors such that I don't run into catastrophic
> backtraking. Another note, I'm using u32regex_search and my flags
> are match_default | match_partial. I have to use match_partial
> because the input data is long and I've seen memory exhausted even
> when I've increased BOOST_REGEX_MAX_BLOCKS (I probably have more work
> to do on the expression to reduce the complexity).

There may be a misunderstanding here - or rather I need to improve the error
messages :-) If regex_search throws, it usually does so because the
time-complexity of matching the expression has grown too complex too fast,
it gives up rather than risk going on indefinitely. You *can* get
exceptions from very large expressions if BOOST_REGEX_MAX_BLOCKS is set too
low, but it has to be a truely humongous expression for that to happen :-)

So the normal fix is to try and make the expression as "precise" as possible
and avoid things that end up looking like (.+)+

> On Windows I have no problems. Partial matches are returned and I
> eventually get full matches and my code runs great.
>
> On OS X, u32regex_search returns and indicates it has found a partial
> match (what[0].match == false). But what[0].second is set to the end
> of my input string, so no further matching takes place and my loop
> ends. This is very different behavior than what I'm seeing on
> Windows.

If you get a partial match then what[0].second should always point at the
end of the input: that's what a partial match means - that maybe if there
was more input we could have had a full match.

> I know this email is vague, but I can provide any details that would
> be helpful in solving this issue. Why the difference between Windows
> and OS X, I've compiled both boost libraries with the same compiler
> options on both platforms? Is it an ICU compile issue?

Could be either: if you are using VC 6, 7 or 7.1 on Win32 (but not VC 8)
then the regex engine uses a different recursive (ie quicker) algorithm on
Win32, than it does on platforms/compilers that don't support recovery from
stack-overflows. You can change the behaviour on either platform by
defining either BOOST_REGEX_RECURSIVE or BOOST_REGEX_NON_RECURSIVE in
boost/regex/user.hpp.

Probably the easiest thing is for you to let me have a test case I can play
with.

HTH, John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk