Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2008-03-12 08:39:11


Dave DeLong wrote:
>> Hi everyone,
>>
>> I'm trying to parse an HTML page using the Regex library and am
>> running in to errors.
>>
>> In the following snippets, "pageSource" is a string pointer to the
>> contents of an html file.
>>
>> This code causes my app to crash:
>>
>> void Page::removeScriptTags() {
>> boost::regex tagRegex("<[sS][cC][rR][iI][pP][tT][\\w\\W]*?>[.]*?</\
>> \s*?[sS][cC][rR][iI][pP][tT]\\s*?>");
>> string replaced = boost::regex_replace(*pageSource, pageSource,
>> tagRegex, " ", boost::match_default);
>> delete pageSource;
>> pageSource = new string(replaced);
>> }

That shouldn't even compile - there are too many arguments to
regex_replace - it should just be,

*pageSource = boost::regex_replace(*pageSource, tagRegex, " ",
boost::match_default);

The expression is also needlessly inefficient, you could just make the whole
expression case insensitive by prefixing with "(?:i)", then "[\\w\\W]" will
match *either* something that is a word character, *or* something that is
*not* a word character, which is probably not what you meant :-) Likewise
[.] will match a literal "." which again is probably not what you meant. So
maybe try something like:

"(?:i)<script[^>]*>.*?</script[^>]*>"

>> and this code crashes when attempting to destruct "matches":
>>
>> void Page::findTitleSummary() {
>> boost::cmatch matches;
>> boost::regex
>> bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)</\\s*?
>> [tT][iI][tT][lL][eE]\\s*?>");
>> if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) {
>> pageSummary = new string(matches[1]);
>> hasFoundSummary = true;
>> }
>> }
>>
>> What am I missing?

Without seeing a compilable code sample to play with I don't know, but it
looks like you're accessing memory that's already gone out of scope
somewhere.

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net