Subject: Re: [Boost-bugs] [Boost C++ Libraries] #11776: Effective way to find all regex matches in large file
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2016-11-23 16:13:11
#11776: Effective way to find all regex matches in large file
-------------------------------+--------------------------
Reporter: der-storch-85@⦠| Owner: johnmaddock
Type: Feature Requests | Status: reopened
Milestone: To Be Determined | Component: regex
Version: Boost 1.59.0 | Severity: Optimization
Resolution: | Keywords:
-------------------------------+--------------------------
Comment (by Dr. Robert van Engelen <engelen@â¦>):
I'm working on several C++ projects that require searching and scanning
large text files. I fully agree with the OP that the `partial_match`
implementation is broken. This is very unfortunate, because
`partial_match` provides the only possible mechanism to search streaming
input text without buffering the entire text. To restrict the regex to
simple forms that do not include repetitions (*, +) is not a viable
workaround for these projects (such as the RE/flex
https://sourceforge.net/projects/re-flex project). There are use cases in
which we must take interactive input (i.e. buffering one char at a time)
or take large files in which the pattern searched may not fit in the
current buffer allocated, thus not producing the longest match, and worse
we don't know if the buffer must be enlarged to continue iterating to find
the longest match.
The correct algorithm should consider that '''as long as backtracking on a
repetition pattern in the regex is still possible given some partial input
text, Boost.Regex should flag the result as a partial match instead of a
full match.'''. With this change, matching "`abc.*123`" may require the
whole input, but that is OK!
I added the following note to my project documentation until this problem
is resolved:
{{{
Boost.Regex may fail to find the longest match when repetition patterns
such
as `.*` are used. Under certain conditions greedy repetitions may
behave as
lazy repetitions. For example, the Boost.Regex engine may return the
short
match `abc` when the regex `a.*c` is applied to `abc abc`, instead of
returning the full match `abc abc`. The problem is caused by the
limitations
of Boost.Regex `match_partial` matching algorithm. To work around this
limitation, we suggest to make the repetition pattern as specific as
possible
and not overlap with the pattern that follows the repetition. *The best
solution is to read the entire input, which the RE/flex scanner
generator
does if the input file length can be determined* (but this won't always
work,
e.g. with interactive TTY input). We consider this Boost.Regex partial
match
behavior a bug, not a restriction, because *as long as backtracking on a
repetition pattern is possible given some partial text, Boost.Regex
should
flag the result as a partial match instead of a full match.* Otherwise,
it
is impossible to reliably use Boost.Regex partial matching.
}}}
Looking forward to have a resolution on this issue. I will be happy to
help out.
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/11776#comment:3> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:20 UTC