|
Boost Users : |
From: Detlef Meyer-Eltz (Meyer-Eltz_at_[hidden])
Date: 2006-12-18 14:56:56
Thank you for your quick reply. It's just because of the
leftmost-longest rule, that I'm indeed using POSIX extended style
(from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't
do anything wrong im my software, I now have made an extra commandline
program:
///////////////////////////////////////////
#pragma hdrstop
#include <iostream>
#include <string>
#include "boost\regex.hpp"
using namespace std;
using namespace boost;
//---------------------------------------------------------------------------
int main(int argc, char* argv[])
{
// my usual option, but makes no difference here, I think
regex_constants::syntax_option_type syntax_flags = regex_constants::extended
& ~regex_constants::no_escape_in_lists;
regex expression1("(<[^>]*>)|(<body([^>]*)>)", regex_constants::extended);
regex expression2("(<body([^>]*)>)|(<[^>]*>)", regex_constants::extended);
string testtext = "<body bgcolor=\"white\">";
std::string::const_iterator start, end;
start = testtext.begin();
end = testtext.end();
match_results<std::string::const_iterator> what;
match_flag_type flags = match_default;
cout << "testing: " << expression1 << endl;
if(regex_search(start, end, what, expression1, flags))
{
int iCount = 0;
for(unsigned int i = 0 ; i < what.size() ; ++i )
cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
}
cout << endl;
cout << "testing: " << expression2 << endl;
if(regex_search(start, end, what, expression2, flags))
{
int iCount = 0;
for(unsigned int i = 0 ; i < what.size() ; ++i )
cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
}
return 0;
}
/////////////////////////////////////////////////////////
That's the result:
testing: (<[^>]*>)|(<body([^>]*)>)
sub-expression0: "<body bgcolor="white">" // (<[^>]*>)|(<body([^>]*)>)
sub-expression1: "<body bgcolor="white">" // <[^>]*>
sub-expression2: "" // <body([^>]*)>
sub-expression3: "" // [^>]*
testing: (<body([^>]*)>)|(<[^>]*>)
sub-expression0: "<body bgcolor="white">" // (<body([^>]*)>)|(<[^>]*>)
sub-expression1: "<body bgcolor="white">" // <body([^>]*)>
sub-expression2: " bgcolor="white"" // [^>]*
sub-expression3: "" // <[^>]*>
I hoped, for both arrangements of the alternatives, the matching parts
would be the same. I would be very happy, if I had made a mistake ;-)
Best regards
Detlef Meyer-Eltz
-- am Montag, 18. Dezember 2006 um 18:31 schrieben Sie: > Detlef Meyer-Eltz wrote: >> I have a difficulty to predict, which part of a regular expression >> will match. >> >> Example: >> I have a regular expression for a general HTML tag: <[^>]*> >> combined with an expression for the body tag: <body([^>]*)> >> >> to: (<[^>]*>)|(<body([^>]*)>) >> >> This expression matches the text: <body bgcolor="white"> >> >> As both alternatives can match the input with the same length, I >> expected, that the repeated fouth part of the "Leftmost Longest" Rule >> would determine, which alternatve is chosen: >> >> 4. Find the match which has matched the first sub-expression in the >> leftmost position, along with any ties. If there is only on(e) >> such match possible then return it. >> >> // note the missing 'e' >> >> As the tag-expression has no sub-expression at all, the >> body-expression should win. Its sub-expression could match, but >> doesn't. It seems to me, that the sequence of the alternatives >> determines the match. >> >> Now I guess, that I misinterpreted 4.: its not a means to predict the >> matching alternative but only to find the one that matched >> accidentally? My software constructs lexers from elementary >> expressions automatically. So it's important for me to direct and >> predict the expected matching alternative. Are there any other rules? >> Does the sequence of the alternatives determine the match >> unmistakably? > Which Boost.Regex version are you using, and how are you compiling the > expression? > Recent versions default to the Perl matching rules: *which do not use the > leftmost longest rule*. They match based on a "first match found" rule, so > if the first alternative leads to a match then subsequent alternatives are > never examined. > If you really want leftmost-longest semantics, then compile the expression > as a POSIX extended regex, but of course then you loose the ability to use > Perl-like regex extensions. > HTH, John. > PS, your analysis of the leftmost-longest rule looks correct however. > _______________________________________________ > Boost-users mailing list > Boost-users_at_[hidden] > http://lists.boost.org/mailman/listinfo.cgi/boost-users
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net