Boost logo

Boost Users :

From: Christopher Hart (hartct_at_[hidden])
Date: 2006-05-27 17:07:46


All:

I think I'm seeing strange behavior with the greediness operators in
Boost.Regex (1.33.1, Mac OS 10.4.6), but was hoping someone could
confirm that I'm using them correctly. For example, the following two
calls:

find_matches(".*a href=(.*?)>",
"<html><head><title>test</title></head><body>this is a test<br/><a
href=testlink.html><br/>more text</body></html>");
find_matches(".*a href=(.*)>",
"<html><head><title>test</title></head><body>this is a test<br/><a
href=testlink.html><br/>more text</body></html>");

Produce the same output:

Expression: ".*a href=(.*?)>"
Text: "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>"
** Match found **
   Sub-Expressions:
      $0 = "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>"
      $1 = "testlink.html><br/>more text</body></html"
   Captures:
      $0 = { "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>" }
      $1 = { "testlink.html><br/>more text</body></html" }
Expression: ".*a href=(.*)>"
Text: "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>"
** Match found **
   Sub-Expressions:
      $0 = "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>"
      $1 = "testlink.html><br/>more text</body></html"
   Captures:
      $0 = { "<html><head><title>test</title></head><body>this is a
test<br/><a href=testlink.html><br/>more text</body></html>" }
      $1 = { "testlink.html><br/>more text</body></html" }

It seems like the (.*?) expression should match only "testlink.html",
as the ">" character terminates the pattern and would consume the
least amount. (This same expression is working as written in Perl.)

The find_matches function looks like:

void find_matches(const std::string& regx, const std::string& text)
{
   boost::regex e(regx);
   boost::smatch what;
   std::cout << "Expression: \"" << regx << "\"\n";
   std::cout << "Text: \"" << text << "\"\n";
   if(boost::regex_match(text, what, e, boost::match_extra |
boost::match_partial))
   {
      unsigned i, j;
      std::cout << "** Match found **\n Sub-Expressions:\n";
      for(i = 0; i < what.size(); ++i)
         std::cout << " $" << i << " = \"" << what[i] << "\"\n";
      std::cout << " Captures:\n";
      for(i = 0; i < what.size(); ++i)
      {
         std::cout << " $" << i << " = {";
         for(j = 0; j < what.captures(i).size(); ++j)
         {
            if(j)
               std::cout << ", ";
            else
               std::cout << " ";
            std::cout << "\"" << what.captures(i)[j] << "\"";
         }
         std::cout << " }\n";
      }
   }
   else
   {
      std::cout << "** No Match found **\n";
   }
}

Am I missing something in the usage, or is this a bug? Any guidance
is appreciated.

Thanks,
Chris Hart


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net