Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2007-11-26 11:50:31


hallouina-ml_at_[hidden] wrote:
> Hello;
>
> I try to extract an url from a webpage and it's almostly done but
> completly unoptimised :
>
> Before I try with a regex iterator. But I don't understand the
> documentation.

:-(

Did you see this
example:http://www.boost.org/libs/regex/example/snippets/regex_token_iterator_eg_2.cpp

It does exactly what you want - it exacts all the URL's from a HTML file.

> boost::regex rexp(".*(http:\\/\\/.+)\"*.*");
>
>
> and I get this result :
>
> http://www.nolife-tv.com/"
> http://www.nolife-tv.com">
> http://www.nolife-tv.com/images/stories/noiz/1.jpg"
> http://www.nolife-tv.com/component/option,com_poll/task,results/id,16/Itemid,47/';"
> http://www.joomla.org"
> http://www.google-analytics.com/urchin.js"
> http://www.omniture.com
>
> and so on...
>
> I will cut and get only the url without the " or '
> why this regex get the " with it? I put the close bracket before the
> " so why? I already try to do \\" rather than \"

Because the .* on the end of the expression will match whatever text follows
the ", the grouping construct (...) spits out a *sub-expression* which you
can access via the match_results::operator[] or match_results::str(i)
methods.

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net