Boost logo

Boost :

From: Jonathan Graehl (jonathan_at_[hidden])
Date: 2004-09-09 18:55:31


I think we can agree to the following:

Openable and Closeable interfaces (nasty spelling - like useable/usable,
closeable/closable are both used widely - same probably applies to
localizeable/localizable - probably localisable too in England)

Either all filters are nonblocking filters, or some filters are blocking
and can't be attached to nonblocking source/sinks (is there an easy way
to make the boilerplate for the non bufferable type simple enough that
copy/paste isn't needed? obviously, CPP can always be used but isn't
self-documenting).

Personally, I favor eliminating blocking filters; they aren't that much
simpler IMO. I guess we're more or less talking about cooperative vs.
preemptive multitasking (assuming you're forced to use a thread for each
blocking sources/sink, if you want to use simple blocking filters,
rather than nonblocking/multiplexed I/O with nonblocking filters)

>Suppose I have an input filter f which expects that each time it invokes get()
>on the component immediately downstream it will block until the next character
>is available or the end of the stream is reached. Suppose that g is immediately
>upstream from f and wants to call f.get(). Supposing that no input is going to
>be available for a long time, how can you turn that into a non-blocking call?
>

This is tricky. I believe I already described the method for a blocking
output filter, and the same strategy *almost* works for a blocking input
filter as well.

source . A . B

where A is a blocking filter and source is a nonblocking source (can
return EAGAIN when you get/read). B is either the user's code, or
another nonblocking filter.

source . inWrapper . A . outWrapper . B

inWrapper communicates secretly with outWrapper, bypassing A entirely
and just returning EAGAIN to B whenever inWrapper wouldn't have anything
to give A. But, we can't predict ahead of time how much stuff an input
filter A may consume in order to generate a single output character in
response to get(), so the method of outWrapper bypassing A if inWrapper
has not enough A, actually can't work. We can still create a thread
somewhere and patch things up, but let's just say that it's too
complicated for now - besides, you can just use blocking stuff and put
the whole process that reads from source into its own thread -
threads+blocking is easier than asynchronous/cooperative anyway, so such
a simulation would be of dubious value.

This discussion has caused me to seriously question the need for a
simple InputFilter interface, or even a "pull" InputFilter at all. The
Symmetric Filter idea seems most appropriate to me (and sufficiently
general, if you allow arbitrary iterator types instead of just char *).

Since we now allow boost::read to return less characters read (even 0)
without meaning EOF/error, if you really only want to output one
character then return, you can conform to the MultiCharacter (Buffered)
interface but ignore n, always returning at most one character.
Additionally, except for the simplest filters, you still have to deal
with read requests that are less than what you would prefer to output,
and due to lack of continuations/coroutines in the language, you'll
have to reify your current state and resume next time somebody tries to
read you.

I suppose that an InputFilter really does allow you to write your
parsing/recognition code in buffer-oblivious form (that is, you're to
keep reading until you get a whole block of compressed data, or until
you completely match your pattern), and you can reify your output state
by simply buffering a whole chunk of output internally. But what's the
problem with buffering input if you need to? You already have to
potentially buffer output.

There is a pretty well known stream/message pipeline mechanism: linked
lists of buffer chunks allocated by the producer, passed by
reference/pointer to the consumer, and freed when totally consumed (you
could allow a transfer-ownership operation, which would be nice for
in-place transforms, but don't really need one). Think message
passing/queuing, but ignoring message boundaries and interpreting the
data as a stream.

Fixed size buffer chunks (that may be partially full) are usually called
"mbufs", and a list of them is an mbuf chain. The only complexity
involved in processing chains is that you have to either write your
filter so that it operates character by character (a convenience
iterator could step through the chain), or you have to reassemble the
mbuf fragments into one contiguous buffer for reading. You can adopt
more complex logic that works directly on a sequence of mbufs (for
instance Posix supports writing/reading files/sockets to/from a sequence
of noncontiguous buffers - called "scatter/gather IO"). You can imagine
variations where you also allow circular buffering rather than one-pass
use of mbufs.

This reminds me: please consider the possibility of (some day in the
future) automagically running pipelined filters in their own threads
(properly synchronized, of course), and possibly some policy allowing
the granularity of work done to be increased (generally, larger buffers
would allow this) to minimize context switching overhead (which exists
even for nonthreaded pipelines, because of cache coherency,
state/continuation reification, and least important, function call
overhead). mbuf chains might be easier to efficiently synchronize in a
multithreaded environment than semaphores/circular buffers (wild
speculation).

>I guess you can
>> repeatedly open() and close() fstreams that way, although I've never
>> wanted to.
>
>I want to provide the same functionality that the standard library provides.
>People would complain if you couldn't close and reopen an fstream, don't you
>think?
>

Suffice it to say that I was heretofore unaware such capability ever
existed, and you're obviously right to want it for your library :)

> struct my_filter : input_filter {
> template<typename Source>
> std::streamsize read(Source& src, char* s, std::streamsize n);
> };
>
>Here the interpretation of the return value is clear: return the number of
>characters read, or -1 for EOF. The trouble is, there are some simple filters
>that it's very hard to express with the above interface.
>
As I mentioned earlier, this wouldn't add any difficulties at all. You
can simply return at most 1 character, no matter how many are requested,
if that really simplifies your implementation. Naturally, there's some
runtime overhead that wouldn't be necessary with your separate
one-character interface, but doing things that way is generally less
efficient anyway - you've already opted for programmer time over machine
time.

Sorry this has gotten to be so involved - I am perfectly happy to just
use whatever interfaces you have now - it's better than what I have
(i.e. nothing). The binary I/O you propose (in "future directions")
sounds cool also - especially if the library had both native and
"network byte order" variants. Obviously anything involving passing
mbuf chains instead of internal buffering would be a huge change to
interface and implementation; I don't think it needs to happen, or if it
does, there would be a place for it as a second library that also
handles message-passing pipelines (where message boundaries are
preserved and not just implementation details of streams).

Yrs,
Jonathan


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk