Re: [Boost-bugs] [Boost C++ Libraries] #8392: Too complex regex for boost::regex (perl, extended) isn't so complex

Subject: Re: [Boost-bugs] [Boost C++ Libraries] #8392: Too complex regex for boost::regex (perl, extended) isn't so complex
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2013-04-21 11:53:37


#8392: Too complex regex for boost::regex (perl, extended) isn't so complex
-------------------------------+--------------------------------------------
  Reporter: bkalinczuk@… | Owner: johnmaddock
      Type: Bugs | Status: closed
 Milestone: To Be Determined | Component: regex
   Version: Boost 1.52.0 | Severity: Problem
Resolution: wontfix | Keywords:
-------------------------------+--------------------------------------------

Comment (by bkalinczuk@…):

 Replying to [comment:5 anonymous]:
> >I checked your example with grep and it had no problem with matching
 the largest matching substring. (There wasn't even any delay).
>
> Grep is a completely different beast: it's a DFA meaning the complexity
 of matching is O(N). Perl compatible engines are NFA's, and in the worst
 case the complexity of obtaining a match is NP-complete, so something like
 O(N!) for an N-character string.
>
> >I understand, that for the purpose of limiting the time taken for regex
 matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he
 wishes. I would expect though, that regex matching will take
 O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the
 example), which is not the case.
>
> No it's much worst than that as mentioned above. NP-complete in the
 worst case (which strangely this nearly is).
>
> >I believe there is a bug with counting number of states in this example
 and the behaviour is not as it is described.
>
> It's not states in the machine that matter - it's how many are
 ''visited'' while trying to find the match. The maximum is set to the
 lowest of:
>
> * BOOST_REGEX_MAX_STATE_COUNT
> * N^2^ for an N character string.
> * A lower limit k, which is hard coded to 100000.
>
> You can change this behavour by editing {{{estimate_max_state_count}}}
 in perl_matcher_common.hpp.


 My estimation have been N * 2^number_of_DFA_states, so for your example it
 would be 56448 (even less then hardcoded), which could be calculated in no
 time.
 I would never expect the regex matching to be dependent on text size in
 any other way then linear.

 There is this article about comparing perl regexes to the awk and grep
 approach:
 http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html

 Clearly I expected a Thompson approach.

-- 
Ticket URL: <https://svn.boost.org/trac/boost/ticket/8392#comment:6>
Boost C++ Libraries <http://www.boost.org/>
Boost provides free peer-reviewed portable C++ source libraries.

This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:12 UTC