Boost Users :

Date view	Thread view	Subject view	Author view

From: Dave DeLong (davedelong_at_[hidden])
Date: 2008-03-12 21:54:45

Next message: Matt Brown: "[Boost-users] Visual C++ does not recognize lambda metafunction as such"
Previous message: Jeremy Conlin: "Re: [Boost-users] program_options: Printing options and their values"
Next in thread: John Maddock: "Re: [Boost-users] Another Regex Question"
Reply: John Maddock: "Re: [Boost-users] Another Regex Question"

This is both a Regex syntax and a boost question (in, so here goes...
I've got the following code to strip out all <a>, <frame>, and <iframe> tags
from a webpage and parse them for their href or src attributes (yes, I
realize that it can potentially grab an <a src=""> or an <iframe href="">,
but that's ok for this project).

Surprise surprise, it doesn't work quite as I'd hoped, and I was wondering
if you could help me ascertain the problem:

(pageSource is a pointer to a string containing the source of the page; the
project specifications allow for the attribute to be formatted with either a
single or double quote or neither around the actual URL. It correctly finds
each tag and attribute, but it's grabbing the URL and also the "> that
follow it.) How can I get rid of the closing "> ?

void Page::parseLinks() {

boost::regex linkTagRegex("(?i)<(a|i?frame)[^>]*>");

boost::regex linkRegex("(?i)(href|src)\\s*?=[\\w]*?([\\W]*?)[\\w]+?");

boost::sregex_token_iterator p(pageSource->begin(), pageSource->end(),
linkTagRegex, 0);

boost::sregex_token_iterator end;

for (; p != end; p++) {

string tag(p->first, p->second);

boost::cmatch matches;

if (boost::regex_search(tag.c_str(), matches, linkRegex)) {

string * newLink = new string(matches[2].first);

URL * foundLink = new URL(newLink);

delete newLink;

foundLink->resolveWithRespectTo(pageURL);

foundLinks->add(foundLink);

}

Thanks!

Dave

text/html attachment: attachment

Next message: Matt Brown: "[Boost-users] Visual C++ does not recognize lambda metafunction as such"
Previous message: Jeremy Conlin: "Re: [Boost-users] program_options: Printing options and their values"
Next in thread: John Maddock: "Re: [Boost-users] Another Regex Question"
Reply: John Maddock: "Re: [Boost-users] Another Regex Question"

Date view	Thread view	Subject view	Author view

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net