Filtering Streambufs

Variations on a Theme by Schwarz

by James Kanze

Introduction

There has been much discussion in the last couple of years concerning the STL, and the abstraction of a sequence (materialized by its concept of iterator) that it so elegantly uses. Strangely enough, however, another major abstraction in the proposed standard library doesn't seem to get much attention, that of an abstract data sink and/or source, as materialized in Jerry Schwarz's streambuf. This may be partially due to the fact that most people only use the formatting interface, and somehow only associate the streambuf with IO. In fact, IO is only a special case of the more general abstraction of a data sink or source.

In this article, I will concentrate on one particular variation of this abstraction, which I call a filtering streambuf. In a filtering streambuf, the streambuf in question is not the ultimate sink or source, but simply processes the data and passes it on. In this way, it acts somewhat like a pipe in UNIX, but within the process. Anyone who has worked with UNIX is familiar with just how powerful this piping idiom can be.

Many different filtering streambuf's are possible: on input, we can filter out comments, or merge continuation lines; on output, we can expand tabs, or insert timestamps. Ideally, we would like to write as little code as possible, which suggests some sort of generic class to handle the boiler plating, with only the actual filtering to be written for each use.

One small note: the iostream library has evolved some in the standards committee, and different compilers will have different versions. For convenience, I've based the following code on the original release which accompanied CFront. This is partially because this corresponds to the version I use most often, and partially because it is more or less a common least denominator; implementations supporting the more modern features will generally also support the older ones. I've also ignored namespaces, largely for the same reasons. You may find that you need some adaptation, to make my code work with your compiler. (All of the code has been tested with Sun CC 4.1, g++ 2.7.2 and Microsoft Visual C++ 5.0; with g++ and the Microsoft compiler, the old iostream library, <iostream.h>, has been used.)

Principles of writing streambufs

In this article, I will also try and explain more of the general principles of writing a streambuf. Judging from questions on the network, this is a little known area of C++, so I cannot suppose knowledge of even the basic principles.

To begin with, the abstraction behind the streambuf is that of a data sink and source. Istream and ostream take care of the formatting, and use a streambuf as the source or sink of individual characters. The class derived from streambuf takes care of any buffering, and of getting the characters to or from their final source or destination. Buffering within the streambuf is done in what are known as the get and the put areas; these will be explained as necessary in the implementation.

istream or ostream has a number of functions with which to interface with the streambuf. For the most part, these functions work on the get or the put area (the buffers). If there is no room in the put area to output a character, the virtual function overflow is called. If there are no characters in the get area when a read is attempted, the virtual function underflow is called. The key to writing a streambuf is in overriding these two functions. It is also generally necessary to override sync, and sometimes setbuf, and if of course, if we wish to support random access, we also have to override seekpos and seekoff. (We will not consider random access here. Since the default behavior of these to functions is to return an error, we can ignore them.)

If the streambuf is to handle input and output simultaneously, we will also have to think about synchronization issues. In fact, filtering streambuf's are almost always unidirectional, and we will in fact only consider unidirectional buffers in this article.

A filtering streambuf for output

For various reasons, output is slightly simpler than input, so we will start with it. The template class will be called FilteringOutputStreambuf; it will be instantiated over a function like object that can be called with a reference to a streambuf and a character to be output, and will return an int: either the next character in the sequence, or EOF if none is available.

First, the class definition. I'll describe the details as they are used in the functions later:

    template< class Inserter >
    class FilteringOutputStreambuf : public streambuf
    {
    public:
                            FilteringOutputStreambuf(
                                streambuf*     dest ,
                                Inserter       i ,
                                bool           deleteWhenFinished 
                                                    = false ) ;    
                            FilteringOutputStreambuf(
                                streambuf*     dest ,
                                bool           deleteWhenFinished 
                                                    = false ) ;
        virtual             ~FilteringOutputStreambuf() ;
        virtual int         overflow( int ch ) ;
        virtual int         underflow() ;
        virtual int         sync() ;
        virtual streambuf*  setbuf( char* p , int len ) ;

        inline Inserter&    inserter() ;

    private:
        streambuf*          myDest ;
        Inserter            myInserter ;
        bool                myDeleteWhenFinished ;
    } ;

I'll begin with the easiest functions. Although the recently adopted draft standard defines the default implementation of underflow to fail, this behavior was undefined in earlier implementations, so it is generally a good idea to override this function to simply return EOF -- any attempt to read from our output filtering streambuf should result in failure.

We generally want to be able to look at every character, so any buffering is out. To avoid buffering in the output streambuf, it is sufficient to never define a buffer. As for the function setbuf, I generally just pass it directly on to the final streambuf; this is the class which should be doing the actual buffering anyway. (Of course, since myDest is a pointer, I check for NULL before doing so.)

Sync is also surprisingly simple. Since we don't have any buffering, there is nothing to synchronize, so we just pass this one on to the final destination as well.

The function inserter is just for convenience -- I've never actually found a use for it on output, but its equivalent on input is sometimes useful, and I like to keep things orthogonal. It just returns a reference to the myInserter member.

Which leaves overflow:

    template< class Inserter >
    int
    FilteringOutputStreambuf< Inserter >::overflow( int ch )
    {
        int                 result( EOF ) ;
        if ( ch == EOF )
            result = sync() ;
        else if ( myDest != NULL )
        {
            assert( ch >= 0 && ch <= UCHAR_MAX ) ;
            result = myInserter( *myDest , ch ) ;
        }
        return result ;
    }

Although it wasn't ever specified, earlier implementations of iostream would flush the buffer if overflow was called with EOF, and some applications may count on this, so we want to do likewise. Other than that, we ensure that myDest isn't NULL before calling the myInserter, and that's it.

Finally, there are the constructors and the destructors. In fact, there isn't much to say about them either; they simply initialize the obvious local variables. The only particularity is the Boolean flag to transfer ownership of the targeted streambuf; a commodity feature for the user, which also simplifies exception safety. In the destructor, I generally call sync, but it isn't necessary. And of course, if the user asked for it, I delete the final destination streambuf.

A simple use of this class might be to systematically insert a time stamp at the start of every line. (The operator() function is simpler than it looks; the only real work is in getting and formatting the time.)

    class TimeStampInserter
    {
    public:
                            TimeStampInserter()
                                :   myAtStartOfLine( true )
        {
        }

        int                 operator()( streambuf& dst , int ch )
        {
            bool                errorSeen( false ) ;
            if ( myAtStartOfLine && ch != '\n' )
            {
                time_t              t( time( NULL ) ) ;
                tm*                 time = localtime( &t ) ;
                char                buffer[ 128 ] ;
                int                 length(
                    strftime( buffer ,
                              sizeof( buffer ) ,
                              "%c: " ,
                              time ) ) ;
                assert( length > 0 ) ;
                if ( dst.sputn( buffer , length ) != length )
                    errorSeen = true ;
            }
            myAtStartOfLine = (ch == '\n') ;
            return errorSeen
                ?   EOF 
                :   dst.sputc( ch ) ;
        }

    private:
        bool                 myAtStartOfLine ;
    } ;

(In case you're wondering: I don't normally write functions in the class definition like this, but it seems easier for exposition to put everything in one place.)

In this case, there really isn't any reason to ever pass an Inserter argument to the OutputFilteringStreambuf constructor, since all instances of the class are idempotent. If the class didn't require state, you could write it as a function, and pass a pointer to the function as argument to the constructor of an OutputFilteringStreambuf< int (*)( streambuf& , int ) >. (I'd seriously consider using a typedef in this case. The above not only confuses human readers; it has confused more than one compiler I've tried it with as well.)

A filtering streambuf for input

Now we'll do the same thing for input; we'll call it FilteringInputStreambuf. This class is slightly more complicated than the output one, because of the interface definition of underflow: underflow does not extract a character from the input stream, it simply ensures that there is a character in the buffer. Which in turn means that we cannot ignore the issue of bufferization completely. Anyway, here's the class definition; the instantiation type must be callable with a reference to a streambuf, and return an int (either the character read or EOF):

    template< class Extractor >
    class FilteringInputStreambuf : public streambuf
    {
    public:
                            FilteringInputStreambuf(
                                streambuf*          source ,
                                Extractor           x ,
                                bool                deleteWhenFinished 
                                                        = false ) ;
                            FilteringInputStreambuf(
                                streambuf*          source ,
                                bool                deleteWhenFinished 
                                                        = false ) ;
        virtual             ~FilteringInputStreambuf() ;
        virtual int         overflow( int ) ;
        virtual int         underflow() ;
        virtual int         sync() ;
        virtual streambuf*  setbuf( char* p , int len ) ;

        inline Extractor&   extractor() ;

    private:
        streambuf*          mySource ;
        Extractor           myExtractor ;
        char                myBuffer ;
        bool                myDeleteWhenFinished ;
    } ;

As with output, we'll do the easy parts first: overflow is just an error (return EOF), and setbuf is forwarded. We could argue that sync should be either an error or forwarded, since it isn't supposed to do anything on an input stream. In fact, in our case, synchronization does have a meaning, since any characters in our local buffer have been extracted from the real input streambuf, but have not been read. The function extractor just returns a reference to the corresponding data member. Unlike the output side, this has a definite use: some of the filters may remove newline characters; in such cases, the extractor should maintain the correct line number from the source, and the user should access the extractor to obtain it for e.g. error messages.

Which brings us to the question of bufferization: we need a buffer of at least one character in order to correctly support the semantics of underflow which in turn are thus defined in order to support non-extracting look-ahead, for parsing things like numbers, where you cannot know when you have finished before having seen a character you don't want. To keep things simple, we maintain a one character buffer directly in the class: myBuffer.

Which gives us enough information to write underflow:

    template< class Extractor >
    int
    FilteringInputStreambuf< Extractor >::underflow()
    {
        int                 result( EOF ) ;
        if ( gptr() < egptr() )
            result = *gptr() ;
        else if ( mySource != NULL )
        {
            result = myExtractor( *mySource ) ;
            if ( result != EOF )
            {
                assert( result >= 0 && result <= UCHAR_MAX ) ;
                myBuffer = result ;
                setg( &myBuffer , &myBuffer , &myBuffer + 1 ) ;
            }
        }
        return result ;
    }

Several points are worth mentioning:

If a character is already in the buffer, we return it. (The draft standard guarantees that the function will not be called in such cases, but earlier specifications didn't.) Otherwise:
We call the user provided function to get the next character.
If we got one, we put it in our one character buffer, and call setg (a member function of streambuf) to set the pointers into the buffer.

I generally define sync to resynchronize with the actual source:

    template< class Extractor >
    int
    FilteringInputStreambuf< Extractor >::sync()
    {
        int                 result( 0 ) ;
        if ( mySource != NULL )
        {
            if ( gptr() < egptr() )
            {
                result = mySource->sputbackc( *gptr() ) ;
                setg( NULL , NULL , NULL ) ;
            }
            if ( mySource->sync() == EOF )
                result = EOF ;
        }
        return result ;
    }

If we have a character in our buffer, we send it back, and clear the buffer. And I sync with the original source -- it may be a FilteringInputStream as well.

The constructors are just simple initialization; the destructor adds a call to sync, and deletion of the source if requested.

I tend to use this class much more often than the output. One simple example: stripping end of line comments:

    class UncommentExtractor
    {
    public:
                            UncommentExtractor( char commentChar = '#' )
                                :   myCommentChar( commentChar )
        {
        }

        int                 operator()( streambuf& src )
        {
           int                 ch( src.sbumpc() ) ;
           if ( ch == myCommentChar )
           {
              while ( ch != EOF && ch != '\n' )
                 ch = src.sbumpc() ;
           }
           return ch ;
        }

    private:
        char                myCommentChar ;
    } ;

Simplifying use

With the above, we can already do everything necessary. Still, it is often extra work to have to declare the streambuf and the istream or ostream separately. So it is convenient to also define the corresponding template classes for istream and ostream. Here's the class definition for FilteringIstream; FilteringOstream follows the same pattern:

    template< class Extractor >
    class FilteringIstream 
        :   private FilteringInputStreambuf< Extractor >
        ,   public istream
    {
    public:
                            FilteringIstream( istream& source ,
                                              Extractor x ) ;
                            FilteringIstream( istream& source ) ;
                            FilteringIstream( 
                                streambuf*          source ,
                                Extractor           x ,
                                bool                deleteWhenFinished
                                                        = false ) ;
                            FilteringIstream( 
                                streambuf*          source ,
                                bool                deleteWhenFinished
                                                        = false ) ;
        virtual             ~FilteringIstream() ;

        FilteringInputStreambuf< Extractor >*
                            rdbuf() ;
    } ;

The somewhat unusual inheritance is a trick I learned from Dietmar Kühl; it serves to ensure that the streambuf is fully initialized before its address is passed to the constructor of istream, without having to allocate it dynamically on the stack. It's also worth noting the constructors taking an istream&, instead of a streambuf*; again, just a convenience, but it means that you can pass cin directly as an argument, rather than having to use cin.rdbuf(). (The call to rdbuf is still there, of course. In the initialization list of the constructor.) And if another istream is using the streambuf, you certainly don't want to delete it, so we drop that parameter completely.

With all this, if you want to read standard in, ignoring end of line comments, all you need is the UncommentExtractor shown above, and the following definition:

    FilteringIstream< UncommentExtractor >
                        input( cin ) ;

That's all there is to it.

Conclusions

As you have seen, creating your own streambuf's can be a powerful idiom. And we've only scratched the surface of the possibilities. The complete code for all of the classes discussed in this article, along with a number of additional inserters and extractors, can be downloaded from this site, so you can try it yourself.

It would be unfair if I tried to take credit for the entire concept. First and foremost, if Jerry Schwarz hadn't come up with the original idea of separating the sinking and sourcing of the data from the formatting, none of this would have been possible. And most of what I know about iostream, I learned from contributors in the C++ newsgroups, particularly Steve Clamage, who has always taken the time to answer most of the serious questions posed there. More recently, people like Dietmar Kühl have been pursuing similar paths of research.

Finally, I owe particular thanks to the customer site at which I first applied this technique, the LTS division of Alcatel SEL, in Stuttgart, and to my boss there, Ömer Oskay. The freedom they gave me to pursue new ways of doing things was amazing, and while I like to think that it always paid off for them in the end, it certainly wasn't always obvious beforehand that it would. Without their confidence in me, most of this work would not have been possible.