[regex] - Bug in boost::regex ??

Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4. The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file. Thanks Kiran.

kiran wrote:
Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4.
The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file.
It's doing exactly what the regex asks it to do: it matches everything from the first occurance of "Resurfacing" to the last occurance of "home". There is only one such match in the document. John.

Hi It is DEFINITELY not doing what it is asked to do. The EXPECTED OUTPUT was :: Resurfacing for Swimming Pools</title> <meta name="robots" content="index,follow">Home <meta name="keywords" content="pool Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool resurfacing">Home But the RESULTANT OUTPUT we got was :: Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool resurfacing">Home This means that it is not picking the "Resurfacing" in the SECOND line of the file, but rather picking the "Resurfacing" in the FOURTH line of the file. Why is the second one not picked ? This was my question. Regards Kiran. ----- Original Message ----- From: "John Maddock" <john@johnmaddock.co.uk> To: <boost-users@lists.boost.org> Sent: Wednesday, August 30, 2006 4:12 PM Subject: Re: [Boost-users] [regex] - Bug in boost::regex ??
kiran wrote:
Hi I am trying to extract a pattern from a file. Actually there are 2 occurances of the pattern in the file, one in line 2-3 and other in line 4. But the program is only reporting the occurance in line 4.
The reason why i am giving (.|\n) coupled 'match_not_dot_newline' with is that i want the regex to be perl compatible. Why is the program not reporting the first occurance ? Is this a bug ? I am attaching the code and file.
It's doing exactly what the regex asks it to do: it matches everything from the first occurance of "Resurfacing" to the last occurance of "home". There is only one such match in the document.
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it actually compiled, and didn't reply on external files, and I see exactly the output expected: everything from the first "Resurfacing" to the last "home". #include "boost/regex.hpp" using namespace boost; using namespace std; #include<fcntl.h> #include<sys/types.h> #include <iostream> int main() { char buf[10000]; //int fd = open("glass.htm", O_RDONLY); //int size = read(fd, buf, 10000); string line = "<!-- saved from url=(0022)http://internet.e-mail -->\n" "<html><head>\n" "<title>UGlassIt Fibre-Shelkote Pool Resurfacing for Swimming Pools</title>\n" "<meta name=\"robots\" content=\"index,follow\">Home\n" "<meta name=\"keywords\" content=\"pool Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool resurfacing\">Home"; //close(fd); regex expr("Resurfacing(.|\n)*Home" , boost::regex::icase | boost::regex::perl); try { sregex_iterator itr(line.begin(), line.end(), expr, boost::match_not_dot_newline); sregex_iterator i; while(itr != i) { cout<<string((*itr)[0].first, (*itr)[0].second)<<" "<<(*itr).position(0)<<endl; itr++; } } catch(std::runtime_error e) { cout<<e.what()<<endl<<flush; } }

Taking a quick look at the docs, the regex you want is: "Resurfacing(.*?)Home" Just a thought. Seems like quite the thread for a regex pattern. And like John says, it should match from the first Resurfacing to the second Home. If it didn't, I'd be concerned. The * operator by itself is greedy. It wants to make matches as long as possible. By using the *? notation, it makes it a non-greedy modifier, ie, making the match as short as possible. http://www.boost.org/libs/regex/doc/syntax_perl.html Under the heading 'Non greedy repeats' pretty much explains things. (Note: This applys to the perl style regex, I'm not entirely sure about the other behaviors.) Cheers, Paul On 8/30/06, John Maddock <john@johnmaddock.co.uk> wrote:
kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it actually compiled, and didn't reply on external files, and I see exactly the output expected: everything from the first "Resurfacing" to the last "home".
#include "boost/regex.hpp" using namespace boost; using namespace std; #include<fcntl.h> #include<sys/types.h> #include <iostream>
int main() { char buf[10000]; //int fd = open("glass.htm", O_RDONLY); //int size = read(fd, buf, 10000); string line = "<!-- saved from url=(0022)http://internet.e-mail -->\n" "<html><head>\n" "<title>UGlassIt Fibre-Shelkote Pool Resurfacing for Swimming Pools</title>\n" "<meta name=\"robots\" content=\"index,follow\">Home\n" "<meta name=\"keywords\" content=\"pool Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool resurfacing\">Home"; //close(fd); regex expr("Resurfacing(.|\n)*Home" , boost::regex::icase | boost::regex::perl); try { sregex_iterator itr(line.begin(), line.end(), expr, boost::match_not_dot_newline); sregex_iterator i; while(itr != i) { cout<<string((*itr)[0].first, (*itr)[0].second)<<" "<<(*itr).position(0)<<endl; itr++; } } catch(std::runtime_error e) { cout<<e.what()<<endl<<flush; } }
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Hi, Thanks for answering the question. That shows that that is not bug in boost::regex. But I have one more thing to ask. Ofcourse when the dependancy on external file is removed boost::regex is working fine. When you ran the code with the string directly from the file, did your regex pick the second one ? You can certainly think that this is a question not to be answered by a busy person like you. You might choose to ignore this question. But if you can, please answer this. Also please tell me which version of boost library are you using? I am running the code in a linux machine and the code i already sent is not picking the second one. Thanks Kiran. ----- Original Message ----- From: "John Maddock" <john@johnmaddock.co.uk> To: <boost-users@lists.boost.org> Sent: Wednesday, August 30, 2006 8:53 PM Subject: Re: [Boost-users] [regex] - Bug in boost::regex ??
kiran wrote:
Why is the second one not picked ? This was my question.
It is picked for me: I modified your sample program (see below) so that it actually compiled, and didn't reply on external files, and I see exactly the output expected: everything from the first "Resurfacing" to the last "home".
#include "boost/regex.hpp" using namespace boost; using namespace std; #include<fcntl.h> #include<sys/types.h> #include <iostream>
int main() { char buf[10000]; //int fd = open("glass.htm", O_RDONLY); //int size = read(fd, buf, 10000); string line = "<!-- saved from url=(0022)http://internet.e-mail -->\n" "<html><head>\n" "<title>UGlassIt Fibre-Shelkote Pool Resurfacing for Swimming Pools</title>\n" "<meta name=\"robots\" content=\"index,follow\">Home\n" "<meta name=\"keywords\" content=\"pool Resurfacing,uglassit,fibre-shelkote,Uglassit,Fibre-shelkote,swimming pool resurfacing\">Home"; //close(fd); regex expr("Resurfacing(.|\n)*Home" , boost::regex::icase | boost::regex::perl); try { sregex_iterator itr(line.begin(), line.end(), expr, boost::match_not_dot_newline); sregex_iterator i; while(itr != i) { cout<<string((*itr)[0].first, (*itr)[0].second)<<" "<<(*itr).position(0)<<endl; itr++; } } catch(std::runtime_error e) { cout<<e.what()<<endl<<flush; } }
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

kiran wrote:
Hi, Thanks for answering the question. That shows that that is not bug in boost::regex. But I have one more thing to ask. Ofcourse when the dependancy on external file is removed boost::regex is working fine. When you ran the code with the string directly from the file, did your regex pick the second one ? You can certainly think that this is a question not to be answered by a busy person like you. You might choose to ignore this question. But if you can, please answer this. Also please tell me which version of boost library are you using? I am running the code in a linux machine and the code i already sent is not picking the second one.
I didn't run loading from the file: not enough time for that, sorry. In any case a quick check in the debugger, or even a cout << the_string; would quickly tell you what's getting loaded. I'm using what will become Boost-1.34. But there shouldn't be any differences to previous versions, although I would recomend use Boost-1.33.1 if you can. John.
participants (3)
-
John Maddock
-
kiran
-
Paul Davis