Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2008-03-13 06:53:48


Dave DeLong wrote:
>> This is both a Regex syntax and a boost question (in, so here goes...
>> I've got the following code to strip out all <a>, <frame>, and
>> <iframe> tags from a webpage and parse them for their href or src
>> attributes (yes, I realize that it can potentially grab an <a
>> src=""> or an <iframe href="">, but that's ok for this project).
>>
>> Surprise surprise, it doesn't work quite as I'd hoped, and I was
>> wondering if you could help me ascertain the problem:
>>
>> (pageSource is a pointer to a string containing the source of the
>> page; the project specifications allow for the attribute to be
>> formatted with either a single or double quote or neither around the
>> actual URL. It correctly finds each tag and attribute, but it's
>> grabbing the URL and also the "> that follow it.) How can I get rid
>> of the closing "> ?

Since href's must be quoted to be valid HTML how about searching for:

"href=\"[^\"]*\""

But if this is an assignment, I'm not sure we should be answering ;-)

John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net