Boost logo

Boost Users :

From: Detlef Meyer-Eltz (Meyer-Eltz_at_[hidden])
Date: 2006-12-18 14:56:56


Thank you for your quick reply. It's just because of the
leftmost-longest rule, that I'm indeed using POSIX extended style
(from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't
do anything wrong im my software, I now have made an extra commandline
program:

///////////////////////////////////////////

#pragma hdrstop

#include <iostream>
#include <string>
#include "boost\regex.hpp"

using namespace std;
using namespace boost;
//---------------------------------------------------------------------------

int main(int argc, char* argv[])
{
  // my usual option, but makes no difference here, I think
  regex_constants::syntax_option_type syntax_flags = regex_constants::extended
                                                      & ~regex_constants::no_escape_in_lists;

  regex expression1("(<[^>]*>)|(<body([^>]*)>)", regex_constants::extended);
  regex expression2("(<body([^>]*)>)|(<[^>]*>)", regex_constants::extended);
  string testtext = "<body bgcolor=\"white\">";
  std::string::const_iterator start, end;
  start = testtext.begin();
  end = testtext.end();

  match_results<std::string::const_iterator> what;
  match_flag_type flags = match_default;

  cout << "testing: " << expression1 << endl;
  if(regex_search(start, end, what, expression1, flags))
  {
    int iCount = 0;
    for(unsigned int i = 0 ; i < what.size() ; ++i )
      cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
  }

  cout << endl;

  cout << "testing: " << expression2 << endl;
  if(regex_search(start, end, what, expression2, flags))
  {
    int iCount = 0;
    for(unsigned int i = 0 ; i < what.size() ; ++i )
      cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
  }

  return 0;
}

/////////////////////////////////////////////////////////

That's the result:

testing: (<[^>]*>)|(<body([^>]*)>)
sub-expression0: "<body bgcolor="white">" // (<[^>]*>)|(<body([^>]*)>)
sub-expression1: "<body bgcolor="white">" // <[^>]*>
sub-expression2: "" // <body([^>]*)>
sub-expression3: "" // [^>]*

testing: (<body([^>]*)>)|(<[^>]*>)
sub-expression0: "<body bgcolor="white">" // (<body([^>]*)>)|(<[^>]*>)
sub-expression1: "<body bgcolor="white">" // <body([^>]*)>
sub-expression2: " bgcolor="white"" // [^>]*
sub-expression3: "" // <[^>]*>

I hoped, for both arrangements of the alternatives, the matching parts
would be the same. I would be very happy, if I had made a mistake ;-)

Best regards

Detlef Meyer-Eltz

--
am Montag, 18. Dezember 2006 um 18:31 schrieben Sie:
> Detlef Meyer-Eltz wrote:
>> I have a difficulty to predict, which part of a regular expression
>> will match.
>>
>> Example:
>> I have a regular expression for a general HTML tag: <[^>]*>
>> combined with an expression for the body tag: <body([^>]*)>
>>
>> to: (<[^>]*>)|(<body([^>]*)>)
>>
>> This expression matches the text: <body bgcolor="white">
>>
>> As both alternatives can match the input with the same length, I
>> expected, that the repeated fouth part of the "Leftmost Longest" Rule
>> would determine, which alternatve is chosen:
>>
>> 4. Find the match which has matched the first sub-expression in the
>>   leftmost position, along with any ties.  If there is only on(e)
>>   such match possible then return it.
>>
>> // note the missing 'e'
>>
>> As the tag-expression has no sub-expression at all, the
>> body-expression should win. Its sub-expression could match, but
>> doesn't. It seems to me, that the sequence of the alternatives
>> determines the match.
>>
>> Now I guess, that I misinterpreted 4.: its not a means to predict the
>> matching alternative but only to find the one that matched
>> accidentally? My software constructs lexers from elementary
>> expressions automatically. So it's important for me to direct and
>> predict the expected matching alternative. Are there any other rules?
>> Does the sequence of the alternatives determine the match
>> unmistakably?
> Which Boost.Regex version are you using, and how are you compiling the 
> expression?
> Recent versions default to the Perl matching rules: *which do not use the 
> leftmost longest rule*.  They match based on a "first match found" rule, so 
> if the first alternative leads to a match then subsequent alternatives are 
> never examined.
> If you really want leftmost-longest semantics, then compile the expression 
> as a POSIX extended regex, but of course then you loose the ability to use 
> Perl-like regex extensions.
> HTH, John.
> PS, your analysis of the leftmost-longest rule looks correct however.
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net