Re: [Boost-bugs] [Boost C++ Libraries] #11776: Effective way to find all regex matches in large file

Subject: Re: [Boost-bugs] [Boost C++ Libraries] #11776: Effective way to find all regex matches in large file
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2016-11-23 16:13:11

#11776: Effective way to find all regex matches in large file
  Reporter: der-storch-85@… | Owner: johnmaddock
      Type: Feature Requests | Status: reopened
 Milestone: To Be Determined | Component: regex
   Version: Boost 1.59.0 | Severity: Optimization
Resolution: | Keywords:

Comment (by Dr. Robert van Engelen <engelen@…>):

 I'm working on several C++ projects that require searching and scanning
 large text files. I fully agree with the OP that the `partial_match`
 implementation is broken. This is very unfortunate, because
 `partial_match` provides the only possible mechanism to search streaming
 input text without buffering the entire text. To restrict the regex to
 simple forms that do not include repetitions (*, +) is not a viable
 workaround for these projects (such as the RE/flex project). There are use cases in
 which we must take interactive input (i.e. buffering one char at a time)
 or take large files in which the pattern searched may not fit in the
 current buffer allocated, thus not producing the longest match, and worse
 we don't know if the buffer must be enlarged to continue iterating to find
 the longest match.

 The correct algorithm should consider that '''as long as backtracking on a
 repetition pattern in the regex is still possible given some partial input
 text, Boost.Regex should flag the result as a partial match instead of a
 full match.'''. With this change, matching "`abc.*123`" may require the
 whole input, but that is OK!

 I added the following note to my project documentation until this problem
 is resolved:
   Boost.Regex may fail to find the longest match when repetition patterns
   as `.*` are used. Under certain conditions greedy repetitions may
 behave as
   lazy repetitions. For example, the Boost.Regex engine may return the
   match `abc` when the regex `a.*c` is applied to `abc abc`, instead of
   returning the full match `abc abc`. The problem is caused by the
   of Boost.Regex `match_partial` matching algorithm. To work around this
   limitation, we suggest to make the repetition pattern as specific as
   and not overlap with the pattern that follows the repetition. *The best
   solution is to read the entire input, which the RE/flex scanner
   does if the input file length can be determined* (but this won't always
   e.g. with interactive TTY input). We consider this Boost.Regex partial
   behavior a bug, not a restriction, because *as long as backtracking on a
   repetition pattern is possible given some partial text, Boost.Regex
   flag the result as a partial match instead of a full match.* Otherwise,
   is impossible to reliably use Boost.Regex partial matching.

 Looking forward to have a resolution on this issue. I will be happy to
 help out.

Ticket URL: <>
Boost C++ Libraries <>
Boost provides free peer-reviewed portable C++ source libraries.

This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:20 UTC