Boost logo

Boost Users :

From: kiran (kiran.happy_at_[hidden])
Date: 2006-08-29 12:46:05


Hi
      Thanks for answering the question and spending your valuable time to
help novices like me. I have never spoke to person of such high calibre in
my whole life. Be frank, i never expected a reply from you.

      Coming to the topic, I never knew in boost::regex "dot" matched new
line by default. Is there any way to avoid this and make it behave like a
perl regular expression ? Will the "boost::regex::perl" flag do it ?

     Another thing is why is --- regex expr("Resurfacing.*?Home",
boost::regex::icase) ---- getting aborted. ?

Thanks
Kiran.

----- Original Message -----
From: "John Maddock" <john_at_[hidden]>
To: <boost-users_at_[hidden]>
Sent: Tuesday, August 29, 2006 9:28 PM
Subject: Re: [Boost-users] [Regex] - Is this a limitation of
sregex_iterator?

> kiran wrote:
>> Hi
>> I am using sregex_iterator to parse an html file like below.
>>
>> string htmlFile;
>> //populate htmlFile with a file contents
>>
>> regex regExpr("Resurfacing(.|\\n)*Home", boost::regex::icase);
>> sregex_iterator itr(htmlFile.begin(), htmlFile.end(), regExpr);
>>
>> But that is throwing and std::runtime_error exception with message
>> "Regular expression too big". What can i do to avoid this ?
>> I went through http://www.boost.org/libs/regex/doc/configuration.html
>> and changed the variables like below
>>
>> #define BOOST_REGEX_NON_RECURSIVE
>> #define BOOST_REGEX_BLOCKSIZE (4096 * 10)
>> #define BOOST_REGEX_MAX_BLOCKS 1024
>>
>> Even after that i am getting the same error message. I found that
>> when i changed the regular expression ( i mean a simpler 'regExpr'
>> variable) to a simpler one, the exception (std::runtime_error) was
>> not thrown. I need to parse big html files for some complex regular
>> expressions. I dont mind even if the sregex_iterator takes much
>> memory or time.How can i solve this error. ? Or is this a limitatiton
>> of boost::regex library.
>
> The error message is somewhat missleading for which I apologise. It's not
> really a limitation of the library: it's a limitation of Perl style
> regexes.
> The problem is that many Perl style regexes are sufficiently ambiguous
> that
> they can take effectively "forever" to match. Boost.Regex tries to shield
> you from this possibility by keeping track of how many states the
> state-machine has visited, and throwing an exception if the number of
> states
> visited looks to be growing too fast compared to text searched.
>
> I'm a little surprised that this expression should be giving you problems
> but basically:
>
> The .|\\n part is superfluous as . matches newlines by default in
> Boost.Regex anyway. There are also some optimisations that get appied to
> .*
> that don't apply otherwise :-)
>
> If there are large numbers of newlines in the text being searched then
> (.|\\n)* creates a large number of possible branches through the state
> machine: basically the number of possible paths doubles for each newline,
> which is what leads eventually to the exception being thrown.
>
> You might also want to question whether you want a greedy repeat here, and
> whether "Resurfacing.*?Home" wouldn't be more to the point.
>
> HTH, John.
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net