Boost logo

Boost Users :

From: Andrew McDonald (andrew.mcdonald_at_[hidden])
Date: 2006-05-18 00:27:47


Hi,

I suspect I may be about to be hideously embarassed, but for the life of me I cannot see what I have done wrong, so please be kind if this turns out to be my fault ;)

A few days ago I was testing a simple regex based log parser I had written and when it failed to correctly extract some of the log data.

Log files were typically large so the file was read in blocks and match_partial was used to cope with matching data that was broken over a block boundary.
Some poking about revealed that the data in question had been broken over boundary and regex_search had returned no match rather than a partial match, resulting in the initial section of the data being discarded and thus not subsequently matched.

I have since spent several hours refining the original regex and data down to the most minimal form that still reproduces the behaviour.
Sadly I have run out of time for now so have not yet had a chance to snoop around with the debugger
(plus I am not yet familiar with the regex implementation details).
I thought I would post anyway in case this was a known problem or an obvious fubar on my part.

I am using boost_1_33_1, in Microsoft Visual C++ 2005 Express Edition version 8.0.50727.42.
Boost has been built with the vc-8_0 toolset.

The regex in question contains non-greedy repeats and the Boost.Regex history page
http://www.boost.org/libs/regex/doc/history.html
does mention several bug fixes in recent revisions for both non-greedy repeats and partial matches.
I am using the latest version mentioned (1.33.1), so I am assuming these are not the cause.

Additionally the caveats mentioned on
http://www.boost.org/libs/regex/doc/partial_matches.html
regarding expressions that always produce partial matches, and expressions that preferentially produce partial matches to full matches, do not seem to apply.

A search of the GMane mailing list archive also did not seem to find any posts that were relevant.

The regex (default syntax) and test data are:

const regex TEST_REGEX("A[^B]*?B.*?C");
const char TEST_DATA[] = "AxBxC";

The program (included below) matches the test string against the regex, removes the last character from the test string and repeats.

It produces the following output.

Test: AxBxC
Result: Full Match
Test: AxBx
Result: No Match !!!!
Test: AxB
Result: Partial Match
Test: Ax
Result: Partial Match
Test: A
Result: Partial Match

Unless I am completely crazy, logic would suggest that it should be impossible to have no match for a string that is one character less than a full match, especially as we then have partial matches for even less input.

FYI, I cannot seem reduce the regex further without this behaviour disappearing.

All the following (slight) modifications to the problematic regex seem to make the behaviour disappear.

// only first non-greedy repeat
"A[^B]*?B"

// altered first non-greedy repeat
"A.*?B.*?C"

// removed initial fixed char
"[^B]*?B.*?C"

// make first repeat greedy
"A[^B]*?B.*?C"

// make second repeat greedy
"A[^B]*?B.*C"

Test program
=============================================

#include <boost/regex.hpp>
#include <string>
#include <iostream>

using namespace boost;
using namespace std;

namespace boost
{
   void assertion_failed(char const * expr, char const * function, char const * file, long line)
   {
         cout << "Boost assertion failure: ("
                   << expr << ") in " << function << ", "
                   << file << "(" << line << ")" << endl
           ;
   }
}

const regex TEST_REGEX("A[^B]*?B.*?C");
const char TEST_DATA[] = "AxBxC";

const char* END = TEST_DATA + (sizeof(TEST_DATA) - 1);

// output test data and type of match
void test(const string& data)
{
   cout << "Test: " << data << endl;
   cout << "Result: ";

   smatch match;

   if ( !regex_search(data, match, TEST_REGEX, match_partial) )
   {
      cout << "No Match !!!!" << endl;
      return;
   }

   // check for partial match
   if (!match[0].matched)
   {
      cout << "Partial Match" << endl;
      return;
   }

   cout << "Full Match" << endl;
}

int main(int argc, char * argv[])
{
   // start with the full match and
   // iteratively remove last character
   // and check regex matches
   const char* start = TEST_DATA;
   for(const char* end = END; end != start; --end)
   {
      // convert to string just for ease of output
      test(string(start, end));
   }
   return 0;
}

=============================================

regards,

Andrew McDonald
System Architect

> Norwood Systems Australia Pty Ltd
Level 1, 71 Troy Terrace
PO Box 1281
Subiaco, WA 6904

> Tel +61 8 9380 7766
> Fax +61 8 9380 7733
andrew.mcdonald_at_[hidden]
>
> The information in this email, and any attachments, may contain confidential information and is intended solely for the attention and use of the named addressee (s). It must not be disclosed to any person(s) without authorization. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are not authorized to, and must not, disclose, copy, distribute, or retain this message or any part of it. If you have received this communication in error, please notify the sender immediately.
>
>
>
>


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net