Boost logo

Boost Users :

Subject: [Boost-users] [Regex | Xpressive] Efficiently "grepping" large files
From: Thomas Luzat (thomas_at_[hidden])
Date: 2011-08-16 16:44:08


Hello,

I am looking into Boost.Regex and Boost.Xpressive for implementing
grep-like functionality on large files. More specifically I want to
check whether a file contains a match for an expression which does not
need that much context (a few KB). regex_search and
BidirectionalIterators for the whole file work, but there is a
complication:

I want to do other operations on the file in other threads in parallel
(such as calculating hashes and extracting other data). Given that
this is mostly I/O bound I would like to avoid rereading portions of
the file as much as possible. My idea was to implement a circular
buffer which may block the workers (and reading, if data is ever read
faster than analyzed).

The circular buffer needs to know when old data is no longer required,
though, that is regex_search no longer works on it or backtracks into
it. The best idea so far is to read blocks of data and do slightly
overlapped calls of regex_search on those blocks. This approach has
got at least two problems:

1) Matches and context are potentially limited to the size of the overlap.

2) Semantics of the match might be changed if I choose an arbitrary
cut for the overlap (e.g. ^ and $ might change because I cut the
beginning or end of a line)

I would not mind to specify a maximum context of a few KB to maybe 1
MB, but I do not want to change semantics. Is there a better approach?
I was thinking about supplying some intelligent iterators, but
Regex/Xpressive might leave copies around longer than I anticipate,
such that I still cannot decide where they are and how far they may
still backtrack.

If that's not really possible with the boost libraries I would be
interested in alternatives, too. I am already considering looking into
some grep implementations.

Thanks in advance!

Thomas Luzat


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net