Boost logo

Boost Users :

Subject: Re: [Boost-users] Boost Regex Back Reference Issue
From: Mark Stallard (stallard_at_[hidden])
Date: 2017-07-13 17:51:10


Nick wrote:

> According to the Boost documentation for Back references [...]

I use regex's frequently in Perl, C++ and other languages. I use Boost Regex exclusively in C++, even though I probably should be using std::regex instead. I'm much happier using Boost Regex and std::string than I was with PCRE and char*.

I only read the Boost regex docs for the API, not for regex syntax.

> For example the expression:
>
> ^(a*).*\1$
>
> Will match the string:
>
> aaabbaaa
>
> But not the string:
>
> aaabba

This example doesn't make sense to me, either, for same reasons that you mentioned: Both strings begin and end with "a", and a* can match 0 or more characters.

I tested this example on Perl 5.22, adding the string "unmatchable" to the test input:

    $ perl -nle 'print "$_ ($1)" if /^(a*).*\1$/' <<HERE
> aaabbaaa
> aaabba
> unmatchable
> HERE

The output shows both the input string and the submatch captured by $1 and indicates that the pattern will match any input string:

    aaabbaaa (aaa)
    aaabba (a)
    unmatchable ()

I haven't tested this with Boost Regex or any other C++ library, but I believe that this example should be edited or replaced. I would suggest changing (a*) to (a+) and "aaabba" to "aaabbb". Alternatively, you could change (a*) to (a{2}) for the original input strings, but (a+) makes simpler example.

> 2. The other issue is that when I try this example with the string which is posted to not match, the Boost
> regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless
> loop of some sort). Is that a bug?

Ouch! I'm glad nothing like that has happened to me.

It's generally a bad idea to use * too frequently in a regex, especially .* . Adding a back reference to a* only aggravates the issue. I had an experience years ago where too many .*'s killed the performance of a Perl script, and probably consumed way too much memory.

|+| M a r k |+|

Mark Stallard
Engineering & Operations Application Development
Business Application Services
Global Business Services Information Technology

Raytheon Company
Billerica, MA (US)

This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation.
 

-----Original Message-----
From: Boost-users [mailto:boost-users-bounces_at_[hidden]] On Behalf Of Nick via Boost-users
Sent: Thursday, July 13, 2017 10:04 AM
To: boost-users_at_[hidden]
Cc: Nick <nospam_at_[hidden]>
Subject: [Boost-users] Boost Regex Back Reference Issue

According to the Boost documentation for Back references
(http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.back_references) which seems to be the same at least from 1.37 - 1.64.

Quoting a piece of the documentation:

-----

For example the expression:

   ^(a*).*\1$

Will match the string:

   aaabbaaa

But not the string:

   aaabba

-----

I'm finding two issues with the example cited.

1. It seems to me that the example of the string which should not match, should actually match. Ultimately, shouldn't the engine match the marked sub-expression with the first 'a' in order to satisfy the backreference?
I tested this example with Oniguruma and with PHP's PCRE and they both matched the string noted here to not match.

But also, since the marked sub-expression is (a*) then I wonder what the behavior would be if it couldn't make a match on 'a', since the '*' will allow for zero matches. In fact, it seems like everything in the pattern is effectively "optional" due to the '*' operator.

I'm a novice with Perl, but unless I made a mistake, it will match
unconditionally:

print "It matches\n" if "aaabba" =~ /^(a*).*\1$/; It matches print "It matches\n" if "aaabbac" =~ /^(a*).*\1$/; It matches print "It matches\n" if "" =~ /^(a*).*\1$/; It matches print "It matches\n" if "x" =~ /^(a*).*\1$/; It matches print "It matches\n" if "xyz" =~ /^(a*).*\1$/; It matches print "It matches\n" if "123" =~ /^(a*).*\1$/; It matches

2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug?

Nick

_______________________________________________
Boost-users mailing list
Boost-users_at_[hidden]
https://lists.boost.org/mailman/listinfo.cgi/boost-users


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net