Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2005-09-15 11:59:50


Christopher Kohlhoff wrote:
>
> From my reading of the documentation, if I get a partial match and I
> want to continue to try for a full match I must buffer the entire data
> from the beginning of the partial match. This means that in the
> partial_regex_grep example it cannot find a substring match that is
> greater then 4096 characters long, because that is all the data it will
> buffer.

For the example in the docs, that's correct.

>
> Furthermore, each time I want to retest the input against the
> expression it must process the whole input string again. This is not
> ideal from an efficiency point of view, since I could potentially
> receive input data one byte at a time.

True.

> What I want is a stateful regular expression-based decoder object
> state. I can feed it more input which will cause more state
> transitions, and it will tell me when it reaches a terminal state.

Understood. Perhaps what is called for is a "pull" iterator that buffers
chunks of data at a time, and when the buffer underflows, it fetches
another chunk of data. That way, libraries like xpressive and Spirit can
keep their iterator-based interface and not worry whether or not
"++begin" goes to disk for the next 4Kb, or reads from a socket, or
whatever.

Would that address your problem?

> I
> never have to buffer more input than the block just read because
> earlier input will have been fully consumed by the decoder.

That's not the case for a backtracking regex engine like Boost.Regex or
xpressive. These libraries require bidirectional iterators because they
may need to back out state transitions and decrement the iterator to try
a different alternative. You'll need to buffer everything read so far,
or else write it to a tmp file so you can get it back should you need it.

And the problem of returning a partial match and persisting the current
state of the state machine is a hard one. Some implementations maintain
their state on the program stack, so returning effectively wipes out all
that information. These implementations would need to somehow serialize
the state stored on the program stack, and then de-serialize it in order
to begin executing where it left off. Tricky stuff.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk