There has been much discussion in the last couple of years
concerning the STL, and the abstraction of a sequence (materialized
by its concept of iterator) that it so elegantly uses. Strangely
enough, however, another major abstraction in the proposed standard
library doesn't seem to get much attention, that of an abstract data
sink and/or source, as materialized in Jerry Schwarz's
streambuf
. This may be partially due to the fact that
most people only use the formatting interface, and somehow only
associate the streambuf
with IO. In fact, IO is only a
special case of the more general abstraction of a data sink or
source.
In this article, I will concentrate on one particular variation of
this abstraction, which I call a filtering streambuf. In a
filtering streambuf, the streambuf
in question is not
the ultimate sink or source, but simply processes the data and
passes it on. In this way, it acts somewhat like a pipe in UNIX,
but within the process. Anyone who has worked with UNIX is familiar
with just how powerful this piping idiom can be.
Many different filtering streambuf's are possible: on input, we can filter out comments, or merge continuation lines; on output, we can expand tabs, or insert timestamps. Ideally, we would like to write as little code as possible, which suggests some sort of generic class to handle the boiler plating, with only the actual filtering to be written for each use.
One small note: the iostream library has evolved some in the
standards committee, and different compilers will have different
versions. For convenience, I've based the following code on the
original release which accompanied CFront. This is partially
because this corresponds to the version I use most often, and
partially because it is more or less a common least denominator;
implementations supporting the more modern features will generally
also support the older ones. I've also ignored namespaces, largely
for the same reasons. You may find that you need some adaptation, to
make my code work with your compiler. (All of the code has been
tested with Sun CC 4.1, g++ 2.7.2 and Microsoft Visual C++ 5.0; with
g++ and the Microsoft compiler, the old iostream library,
<iostream.h>
, has been used.)
In this article, I will also try and explain more of the general principles of writing a streambuf. Judging from questions on the network, this is a little known area of C++, so I cannot suppose knowledge of even the basic principles.
To begin with, the abstraction behind the streambuf is that of a
data sink and source. Istream and ostream take care of the
formatting, and use a streambuf as the source or sink of individual
characters. The class derived from streambuf
takes
care of any buffering, and of getting the characters to or from
their final source or destination. Buffering within the streambuf
is done in what are known as the get and the put areas; these will
be explained as necessary in the implementation.
istream
or ostream
has a number of
functions with which to interface with the streambuf. For the most
part, these functions work on the get or the put area (the buffers).
If there is no room in the put area to output a character, the
virtual function overflow
is called. If there are no
characters in the get area when a read is attempted, the virtual
function underflow
is called. The key to writing a
streambuf is in overriding these two functions. It is also
generally necessary to override sync
, and sometimes
setbuf
, and if of course, if we wish to support random
access, we also have to override seekpos
and
seekoff
. (We will not consider random access here.
Since the default behavior of these to functions is to return an
error, we can ignore them.)
If the streambuf is to handle input and output simultaneously, we will also have to think about synchronization issues. In fact, filtering streambuf's are almost always unidirectional, and we will in fact only consider unidirectional buffers in this article.
For various reasons, output is slightly simpler than input, so we
will start with it. The template class will be called
FilteringOutputStreambuf
; it will be instantiated over
a function like object that can be called with a reference to a
streambuf and a character to be output, and will return an int:
either the next character in the sequence, or EOF
if
none is available.
First, the class definition. I'll describe the details as they are used in the functions later:
template< class Inserter > class FilteringOutputStreambuf : public streambuf { public: FilteringOutputStreambuf( streambuf* dest , Inserter i , bool deleteWhenFinished = false ) ; FilteringOutputStreambuf( streambuf* dest , bool deleteWhenFinished = false ) ; virtual ~FilteringOutputStreambuf() ; virtual int overflow( int ch ) ; virtual int underflow() ; virtual int sync() ; virtual streambuf* setbuf( char* p , int len ) ; inline Inserter& inserter() ; private: streambuf* myDest ; Inserter myInserter ; bool myDeleteWhenFinished ; } ;
I'll begin with the easiest functions. Although the recently
adopted draft standard defines the default implementation of
underflow
to fail, this behavior was undefined in
earlier implementations, so it is generally a good idea to override
this function to simply return EOF
-- any attempt
to read from our output filtering streambuf should result in
failure.
We generally want to be able to look at every character, so any
buffering is out. To avoid buffering in the output streambuf, it is
sufficient to never define a buffer. As for the function
setbuf
, I generally just pass it directly on to the
final streambuf; this is the class which should be doing the actual
buffering anyway. (Of course, since myDest is a pointer, I check
for NULL
before doing so.)
Sync
is also surprisingly simple. Since we don't have
any buffering, there is nothing to synchronize, so we just pass this
one on to the final destination as well.
The function inserter
is just for convenience --
I've never actually found a use for it on output, but its equivalent
on input is sometimes useful, and I like to keep things orthogonal.
It just returns a reference to the myInserter
member.
Which leaves overflow
:
template< class Inserter > int FilteringOutputStreambuf< Inserter >::overflow( int ch ) { int result( EOF ) ; if ( ch == EOF ) result = sync() ; else if ( myDest != NULL ) { assert( ch >= 0 && ch <= UCHAR_MAX ) ; result = myInserter( *myDest , ch ) ; } return result ; }
Although it wasn't ever specified, earlier implementations of
iostream would flush the buffer if overflow was called with
EOF
, and some applications may count on this, so we
want to do likewise. Other than that, we ensure that myDest isn't
NULL
before calling the myInserter
, and
that's it.
Finally, there are the constructors and the destructors. In fact,
there isn't much to say about them either; they simply initialize
the obvious local variables. The only particularity is the Boolean
flag to transfer ownership of the targeted streambuf; a commodity
feature for the user, which also simplifies exception safety. In
the destructor, I generally call sync
, but it isn't
necessary. And of course, if the user asked for it, I delete the
final destination streambuf.
A simple use of this class might be to systematically insert a time
stamp at the start of every line. (The operator()
function is simpler than it looks; the only real work is in getting
and formatting the time.)
class TimeStampInserter { public: TimeStampInserter() : myAtStartOfLine( true ) { } int operator()( streambuf& dst , int ch ) { bool errorSeen( false ) ; if ( myAtStartOfLine && ch != '\n' ) { time_t t( time( NULL ) ) ; tm* time = localtime( &t ) ; char buffer[ 128 ] ; int length( strftime( buffer , sizeof( buffer ) , "%c: " , time ) ) ; assert( length > 0 ) ; if ( dst.sputn( buffer , length ) != length ) errorSeen = true ; } myAtStartOfLine = (ch == '\n') ; return errorSeen ? EOF : dst.sputc( ch ) ; } private: bool myAtStartOfLine ; } ;
(In case you're wondering: I don't normally write functions in the class definition like this, but it seems easier for exposition to put everything in one place.)
In this case, there really isn't any reason to ever pass an
Inserter
argument to the OutputFilteringStreambuf
constructor, since all instances of the class are idempotent. If
the class didn't require state, you could write it as a function,
and pass a pointer to the function as argument to the constructor of
an
OutputFilteringStreambuf< int (*)( streambuf& , int ) >
.
(I'd seriously consider using a typedef
in
this case. The above not only confuses human readers; it has
confused more than one compiler I've tried it with as well.)
Now we'll do the same thing for input; we'll call it
FilteringInputStreambuf
. This class is slightly more
complicated than the output one, because of the interface definition
of underflow
: underflow
does not extract a
character from the input stream, it simply ensures that there is a
character in the buffer. Which in turn means that we cannot ignore
the issue of bufferization completely. Anyway, here's the class
definition; the instantiation type must be callable with a reference
to a streambuf, and return an int
(either the character
read or EOF
):
template< class Extractor > class FilteringInputStreambuf : public streambuf { public: FilteringInputStreambuf( streambuf* source , Extractor x , bool deleteWhenFinished = false ) ; FilteringInputStreambuf( streambuf* source , bool deleteWhenFinished = false ) ; virtual ~FilteringInputStreambuf() ; virtual int overflow( int ) ; virtual int underflow() ; virtual int sync() ; virtual streambuf* setbuf( char* p , int len ) ; inline Extractor& extractor() ; private: streambuf* mySource ; Extractor myExtractor ; char myBuffer ; bool myDeleteWhenFinished ; } ;
As with output, we'll do the easy parts first: overflow
is just an error (return EOF
), and setbuf
is forwarded. We could argue that sync
should be
either an error or forwarded, since it isn't supposed to do anything
on an input stream. In fact, in our case, synchronization does have
a meaning, since any characters in our local buffer have been
extracted from the real input streambuf, but have not been read.
The function extractor just returns a reference to the corresponding
data member. Unlike the output side, this has a definite use: some
of the filters may remove newline characters; in such cases, the
extractor should maintain the correct line number from the source,
and the user should access the extractor to obtain it for e.g. error
messages.
Which brings us to the question of bufferization: we need a buffer
of at least one character in order to correctly support the
semantics of underflow
which in turn are thus defined
in order to support non-extracting look-ahead, for parsing things
like numbers, where you cannot know when you have finished before
having seen a character you don't want. To keep things simple, we
maintain a one character buffer directly in the class:
myBuffer
.
Which gives us enough information to write underflow
:
template< class Extractor > int FilteringInputStreambuf< Extractor >::underflow() { int result( EOF ) ; if ( gptr() < egptr() ) result = *gptr() ; else if ( mySource != NULL ) { result = myExtractor( *mySource ) ; if ( result != EOF ) { assert( result >= 0 && result <= UCHAR_MAX ) ; myBuffer = result ; setg( &myBuffer , &myBuffer , &myBuffer + 1 ) ; } } return result ; }
Several points are worth mentioning:
setg
(a member function of streambuf
)
to set the pointers into the buffer.
I generally define sync
to resynchronize with the
actual source:
template< class Extractor > int FilteringInputStreambuf< Extractor >::sync() { int result( 0 ) ; if ( mySource != NULL ) { if ( gptr() < egptr() ) { result = mySource->sputbackc( *gptr() ) ; setg( NULL , NULL , NULL ) ; } if ( mySource->sync() == EOF ) result = EOF ; } return result ; }
If we have a character in our buffer, we send it back, and clear the buffer. And I sync with the original source -- it may be a FilteringInputStream as well.
The constructors are just simple initialization; the destructor adds
a call to sync
, and deletion of the source if requested.
I tend to use this class much more often than the output. One simple example: stripping end of line comments:
class UncommentExtractor { public: UncommentExtractor( char commentChar = '#' ) : myCommentChar( commentChar ) { } int operator()( streambuf& src ) { int ch( src.sbumpc() ) ; if ( ch == myCommentChar ) { while ( ch != EOF && ch != '\n' ) ch = src.sbumpc() ; } return ch ; } private: char myCommentChar ; } ;
With the above, we can already do everything necessary. Still, it
is often extra work to have to declare the streambuf and the istream
or ostream separately. So it is convenient to also define the
corresponding template classes for istream and ostream. Here's the
class definition for FilteringIstream
;
FilteringOstream
follows the same pattern:
template< class Extractor > class FilteringIstream : private FilteringInputStreambuf< Extractor > , public istream { public: FilteringIstream( istream& source , Extractor x ) ; FilteringIstream( istream& source ) ; FilteringIstream( streambuf* source , Extractor x , bool deleteWhenFinished = false ) ; FilteringIstream( streambuf* source , bool deleteWhenFinished = false ) ; virtual ~FilteringIstream() ; FilteringInputStreambuf< Extractor >* rdbuf() ; } ;
The somewhat unusual inheritance is a trick I learned from Dietmar
Kühl; it serves to ensure that the streambuf is fully initialized
before its address is passed to the constructor of istream, without
having to allocate it dynamically on the stack. It's also worth
noting the constructors taking an istream&
, instead
of a streambuf*
; again, just a convenience, but it
means that you can pass cin
directly as an argument,
rather than having to use cin.rdbuf()
. (The call to
rdbuf
is still there, of course. In the initialization
list of the constructor.) And if another istream is using the
streambuf, you certainly don't want to delete it, so we drop that
parameter completely.
With all this, if you want to read standard in, ignoring end of line
comments, all you need is the UncommentExtractor
shown
above, and the following definition:
FilteringIstream< UncommentExtractor > input( cin ) ;
That's all there is to it.
As you have seen, creating your own streambuf's can be a powerful idiom. And we've only scratched the surface of the possibilities. The complete code for all of the classes discussed in this article, along with a number of additional inserters and extractors, can be downloaded from this site, so you can try it yourself.
It would be unfair if I tried to take credit for the entire concept. First and foremost, if Jerry Schwarz hadn't come up with the original idea of separating the sinking and sourcing of the data from the formatting, none of this would have been possible. And most of what I know about iostream, I learned from contributors in the C++ newsgroups, particularly Steve Clamage, who has always taken the time to answer most of the serious questions posed there. More recently, people like Dietmar Kühl have been pursuing similar paths of research.
Finally, I owe particular thanks to the customer site at which I first applied this technique, the LTS division of Alcatel SEL, in Stuttgart, and to my boss there, Ömer Oskay. The freedom they gave me to pursue new ways of doing things was amazing, and while I like to think that it always paid off for them in the end, it certainly wasn't always obvious beforehand that it would. Without their confidence in me, most of this work would not have been possible.