Boost logo

Boost Users :

From: Dave DeLong (davedelong_at_[hidden])
Date: 2008-03-12 08:47:18


Ah, you're right. That was one of my attempts to fix it (which you
can guess didn't work).

As for the inefficiency, this is my first stab at regex. =)

Here's the complete function as it stands (or doesn't, since it still
crashes):

void Page::removeScriptTags() {
        boost::regex tagRegex("(?:i)<script[^>]*>.*?</script[^>]*>");
        string source(*pageSource);
        string replaced = boost::regex_replace(source, tagRegex, " ",
boost::match_default);
        delete pageSource;
        pageSource = new string(replaced);
}

PageSource, as I told earlier, is an allocated string that stores the
contents of the webpage. I thought that the problem might be that
pageSource is on the heap, so I've been trying to move it to the stack
to see if that makes a difference. It doesn't seem like it does. I
still crash at this line:

        string replaced = boost::regex_replace(source, tagRegex, " ",
boost::match_default);

Thanks,

Dave

On 12 Mar, 2008, at 6:39 AM, John Maddock wrote:
>
> That shouldn't even compile - there are too many arguments to
> regex_replace - it should just be,
>
> *pageSource = boost::regex_replace(*pageSource, tagRegex, " ",
> boost::match_default);
>
> The expression is also needlessly inefficient, you could just make
> the whole
> expression case insensitive by prefixing with "(?:i)", then "[\\w\
> \W]" will
> match *either* something that is a word character, *or* something
> that is
> *not* a word character, which is probably not what you meant :-)
> Likewise
> [.] will match a literal "." which again is probably not what you
> meant. So
> maybe try something like:
>
> "(?:i)<script[^>]*>.*?</script[^>]*>"
>
>>> and this code crashes when attempting to destruct "matches":
>>>
>>> void Page::findTitleSummary() {
>>> boost::cmatch matches;
>>> boost::regex
>>> bodyRegex("<[tT][iI][tT][lL][eE][\\w\\W]*?>([^<]*)</\\s*?
>>> [tT][iI][tT][lL][eE]\\s*?>");
>>> if (boost::regex_search(pageSource->c_str(), matches, bodyRegex)) {
>>> pageSummary = new string(matches[1]);
>>> hasFoundSummary = true;
>>> }
>>> }
>>>
>>> What am I missing?
>
> Without seeing a compilable code sample to play with I don't know,
> but it
> looks like you're accessing memory that's already gone out of scope
> somewhere.
>
> HTH, John.
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net