Subject: Re: [Boost-bugs] [Boost C++ Libraries] #8392: Too complex regex for boost::regex (perl, extended) isn't so complex
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2013-04-21 11:53:37
#8392: Too complex regex for boost::regex (perl, extended) isn't so complex
-------------------------------+--------------------------------------------
Reporter: bkalinczuk@⦠| Owner: johnmaddock
Type: Bugs | Status: closed
Milestone: To Be Determined | Component: regex
Version: Boost 1.52.0 | Severity: Problem
Resolution: wontfix | Keywords:
-------------------------------+--------------------------------------------
Comment (by bkalinczuk@â¦):
Replying to [comment:5 anonymous]:
> >I checked your example with grep and it had no problem with matching
the largest matching substring. (There wasn't even any delay).
>
> Grep is a completely different beast: it's a DFA meaning the complexity
of matching is O(N). Perl compatible engines are NFA's, and in the worst
case the complexity of obtaining a match is NP-complete, so something like
O(N!) for an N-character string.
>
> >I understand, that for the purpose of limiting the time taken for regex
matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he
wishes. I would expect though, that regex matching will take
O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the
example), which is not the case.
>
> No it's much worst than that as mentioned above. NP-complete in the
worst case (which strangely this nearly is).
>
> >I believe there is a bug with counting number of states in this example
and the behaviour is not as it is described.
>
> It's not states in the machine that matter - it's how many are
''visited'' while trying to find the match. The maximum is set to the
lowest of:
>
> * BOOST_REGEX_MAX_STATE_COUNT
> * N^2^ for an N character string.
> * A lower limit k, which is hard coded to 100000.
>
> You can change this behavour by editing {{{estimate_max_state_count}}}
in perl_matcher_common.hpp.
My estimation have been N * 2^number_of_DFA_states, so for your example it
would be 56448 (even less then hardcoded), which could be calculated in no
time.
I would never expect the regex matching to be dependent on text size in
any other way then linear.
There is this article about comparing perl regexes to the awk and grep
approach:
http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html
Clearly I expected a Thompson approach.
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/8392#comment:6> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:12 UTC