Boost logo

Boost Users :

Subject: Re: [Boost-users] [Regex | Xpressive] Efficiently "grepping" large files
From: John Maddock (boost.regex_at_[hidden])
Date: 2011-08-17 03:51:42


> I am looking into Boost.Regex and Boost.Xpressive for implementing
> grep-like functionality on large files. More specifically I want to check
> whether a file contains a match for an expression which does not need that
> much context (a few KB). regex_search and BidirectionalIterators for the
> whole file work, but there is a complication:
>
> I want to do other operations on the file in other threads in parallel
> (such as calculating hashes and extracting other data). Given that this is
> mostly I/O bound I would like to avoid rereading portions of the file as
> much as possible. My idea was to implement a circular buffer which may
> block the workers (and reading, if data is ever read faster than
> analyzed).
>
> The circular buffer needs to know when old data is no longer required,
> though, that is regex_search no longer works on it or backtracks into it.
> The best idea so far is to read blocks of data and do slightly overlapped
> calls of regex_search on those blocks. This approach has got at least two
> problems:
>
> 1) Matches and context are potentially limited to the size of the overlap.
>
> 2) Semantics of the match might be changed if I choose an arbitrary cut
> for the overlap (e.g. ^ and $ might change because I cut the beginning or
> end of a line)
>
> I would not mind to specify a maximum context of a few KB to maybe 1 MB,
> but I do not want to change semantics. Is there a better approach? I was
> thinking about supplying some intelligent iterators, but Regex/Xpressive
> might leave copies around longer than I anticipate, such that I still
> cannot decide where they are and how far they may still backtrack.
>
> If that's not really possible with the boost libraries I would be
> interested in alternatives, too. I am already considering looking into
> some grep implementations.

Have you looked at the "partial match" feature of Boost.Regex:
http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/boost_regex/partial_matches.html

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net