|
Boost Users : |
From: hallouina-ml_at_[hidden]
Date: 2007-11-26 11:22:55
Hello;
I try to extract an url from a webpage and it's almostly done but completly unoptimised :
Before I try with a regex iterator. But I don't understand the documentation. I past to many time on this way, so I try an other way :
I get my webpage with libcurl
then I replace all " " by a "\n"
like that :
string::size_type i = 0;
while (( i = page_a_analyser.find(' ', i ) ) != (string::npos))
{
page_a_analyser.replace(i++, 1, "\n" );
}
then I apply the regex :
boost::regex rexp(".*(http:\\/\\/.+)\"*.*");
and I get this result :
http://www.nolife-tv.com/"
http://www.nolife-tv.com">
http://www.nolife-tv.com/images/stories/noiz/1.jpg"
http://www.nolife-tv.com/component/option,com_poll/task,results/id,16/Itemid,47/';"
http://www.joomla.org"
http://www.google-analytics.com/urchin.js"
http://www.omniture.com
and so on...
I will cut and get only the url without the " or '
why this regex get the " with it? I put the close bracket before the " so why? I already try to do \\" rather than \"
I try to do (\"|')" too to say " or ', but this doesn't work too...
So I do an other way :
I get my webpage with libcurl
then I replace all " " by a "\n"
then replace all " by \n
then replace all ' by \n
then I apply the regex
And I should replace with 3 while rather than only one... because the 3 conditions in one while wasn't working :
while ( (( i = page_a_analyser.find(' ', i ) ) != (string::npos)) or ( i = page_a_analyser.find('"', i ) ) != (string::npos) or ( i = page_a_analyser.find('\'', i ) ) != (string::npos) )
So how can I do to just improve the regex to extract the url? to do just something like :
replace " " by "\n"
then apply the right regex.
I don't want to use a regex iterator again. regex iterator win again my patience... 3 day on it is enough for me.
Thanks for your attention
_____________________________________________________________________________
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net