Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2007-04-09 05:15:23


Jeff wrote:
> Richard Dingwall <rdingwall <at> gmail.com> writes:
>
>> Have you tried a simpler pattern:
>> <a .*href.*/a>
>
> Richard, thank you for responding. I tried your pattern above and it
> continues beyond '</a>'.
>
> The pattern that I am now using below accurately matches everything
> from
> here '<a href=' to here '</a>'.
>
> ////////////////////////////////
> char exp[] = "<a href(.*?)/a>";
> boost::regex e(exp, boost::regex::normal | boost::regbase::icase);
> boost::sregex_token_iterator i(sFileCont.begin(), sFileCont.end(), e,
> 0); boost::sregex_token_iterator j;
> while(i != j)
> cout << *i++ << "\n";
> ////////////////////////////////
>
> Please note tht sregex_token_iterator's 4th parameter is set to
> submatch = 0 in my code above. This leaves me with 2 questions:
>
> 1. although I have specified submatch = 0, I am creating a marked
> sub- expression, (.*?), and I don't understand why the sub-expression
> is required or if there is a better way that I don't know about.

It's not required unless you want it.

> 2. why is the ? required in the sub-expression above?

Without the ? the .* is greedy: it will match as many characters as it can
before the closing </a> tag, hense you get everything from the first <a> to
the last </a> which is clearly not what you were wanting :-)

You could also try something like:

"<a[^>]+href=\"([^\"]*)\"[^"]*>(.*?)</a>"

Which should give you the URL in $1 and the link text in $2: but only if the
URL is correctly quoted XML, badly written HTML may scrape through, to
handle that you have to get more complex still. There are some more
examples like this here: http://regexlib.com/Search.aspx?k=link

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net