Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2006-08-29 11:58:30


kiran wrote:
> Hi
> I am using sregex_iterator to parse an html file like below.
>
> string htmlFile;
> //populate htmlFile with a file contents
>
> regex regExpr("Resurfacing(.|\\n)*Home", boost::regex::icase);
> sregex_iterator itr(htmlFile.begin(), htmlFile.end(), regExpr);
>
> But that is throwing and std::runtime_error exception with message
> "Regular expression too big". What can i do to avoid this ?
> I went through http://www.boost.org/libs/regex/doc/configuration.html
> and changed the variables like below
>
> #define BOOST_REGEX_NON_RECURSIVE
> #define BOOST_REGEX_BLOCKSIZE (4096 * 10)
> #define BOOST_REGEX_MAX_BLOCKS 1024
>
> Even after that i am getting the same error message. I found that
> when i changed the regular expression ( i mean a simpler 'regExpr'
> variable) to a simpler one, the exception (std::runtime_error) was
> not thrown. I need to parse big html files for some complex regular
> expressions. I dont mind even if the sregex_iterator takes much
> memory or time.How can i solve this error. ? Or is this a limitatiton
> of boost::regex library.

The error message is somewhat missleading for which I apologise. It's not
really a limitation of the library: it's a limitation of Perl style regexes.
The problem is that many Perl style regexes are sufficiently ambiguous that
they can take effectively "forever" to match. Boost.Regex tries to shield
you from this possibility by keeping track of how many states the
state-machine has visited, and throwing an exception if the number of states
visited looks to be growing too fast compared to text searched.

I'm a little surprised that this expression should be giving you problems
but basically:

The .|\\n part is superfluous as . matches newlines by default in
Boost.Regex anyway. There are also some optimisations that get appied to .*
that don't apply otherwise :-)

If there are large numbers of newlines in the text being searched then
(.|\\n)* creates a large number of possible branches through the state
machine: basically the number of possible paths doubles for each newline,
which is what leads eventually to the exception being thrown.

You might also want to question whether you want a greedy repeat here, and
whether "Resurfacing.*?Home" wouldn't be more to the point.

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net