Boost logo

Boost Users :

From: John Maddock (dr_john_maddock_at_[hidden])
Date: 2003-09-29 05:34:45


Message forwarded from Dave Gomboc:

> We have a stream of text data (actually it is a chunked
> response of a webserver, but it might be a file or
> something...) The data size may be very large and we need to
> do some find/replace in this stream using regular
> expressions. It is not possible to load entire stream into
> the memory. I'm not even hope that Regex++ could work with
> such kind of streams, but maybe someone can suggest me a
> solution to this problem?

This is possible without much trouble _if_ the text strings you are
trying to match will fit into main memory, even if all the data won't.
Use a std::list as a buffer, so that deleting and inserting is
inexpensive, and you don't have to perform copies and update iterators
when reading in more input. Fill your (sliding window) buffer, then
regex it, using the partial match feature whenever you have not already
detected the end of the file. If a match occurs, any characters
occuring before the match can be processed (serialized to disk, or
whatever) and removed from the buffer. Repeat this looking for a match
until you reach the end of current input. You'll either have "no match"
or "partial match" at this point: remove non-matching characters (that
could be all of them) at this point, then re-fill your buffer with more
data and keep going. If you're just doing one substitution, that's
fine, but if more, there are a few things to keep in mind.

I have done over 60 distinct substitutions at a time on a stream of 0.1
gigabytes using this technique, and it works quite quickly. (I used a
buffer that held about 1 million characters (2^20), but I also tested
with buffers as small as 256 characters -- the strings I was matching
were not large). Just read in data, process all the substitutions in
order, write the data out and repeat. If you're doing multiple
substitutions, you (of course) can't flush any data from memory until
you've tried all the substitutions on the current region of interest.
Also, for each later substituion, the current region of interest shrinks
because you have to acknowledge partial matches from earlier
subsitutions. For example, say that there is no partial match at the
end of substituions 1 through 6 (there might have been full matches
where substitution was performed, that's okay.) But substituion 7 has a
partial match at the last five characters. Now, your region of interest
for substitution 8 is five characters smaller, because you don't want to
process those last five characters because substitution 7 might change
them once you've read in more input. The order of your substituions is
important if the outputs of some desired subsitutions overlap with the
inputs of others.

If you are not grokking my meaning, I can fish around for the source
code. Perhaps it's easier to explain in code than in words, my code for
the entire process is maybe 10 or 15 lines. An alternative is to try to
do some streambuf hacking, but the above method was simpler and adequate
for my purposes.

Dave

---------------------------------
Want to chat instantly with your online friends? Get the FREE Yahoo!Messenger



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net