Boost logo

Boost :

From: Carlo Wood (carlo_at_[hidden])
Date: 2004-08-31 16:55:46

On Tue, Aug 31, 2004 at 12:45:30PM -0600, Jonathan Turkanis wrote:
> pubsync is called, the contents of the buffer will be sent to the first filter
> in the chain, like so (assuming it's a 'buffered filter'):
> filter_.write(buf, buf + n);
> Yes, for the time being. If your ideas can eliminate copying further, I'd be
> glad to try to incorporate them. (But I haven't looked at your library yet.)

My idea then would involve the introduction of a 'message' object,
something that abstracts a contiguous piece of data with a finite
size that can be processed as a unit. For example, one line of
text in the case of text filters - or one packet of data when
processing a UDP stream - or one binary packet that starts with
an envelope/header followed by a payload etc.

Then, instead of passing (buf, buf + n), this more abstract 'message'
object should be used then. The message object would contain the
'buf' pointer and the size 'n' - not the complete data of course.
Purely for exposition:

struct Message {
  char* buf;
  size_t n;

A filter should then be allowed to do the follow things with this

1) Tell it that the data can be freed.
   If the data is still in the original streambuf then the
   message object would take care of telling the streambuf
   that the part it was holding is now free again.

2) Process it inline - it would not write outside the buffer
   but only examine it and change things perhaps such that
   the result still fits in the same buffer.

3) Copy the data to a newly allocated memory block (which now
   can be larger than the orginal), filtering it while copying
   it if needed. This means that the 'message object' tells
   the streambuf that the data is now freed. Subsequential
   'freeing' of the message would now delete the allocated
   memory block and not that of the stream buf.

To the user of the 'message' only this interface would be
visible (for example):

Message::start() const : Get the start of the message.
Message::size() const : Get the size of the message.
Message::reserved() const : Size of the allocated buffer.
Message::reserve(size) : Increase buffer size (possibly causing a copy).
Message::set_size(size) : Set a new message size.
Message::~Message : Free the underlaying data and destruct the message object.
Message::Message(size) : Create a new Message object with an uninitialized
                                  buffer of size 'size'.

The call to a filter would then become:

           filter_.write(message); // Passing a Message

The reason that this is not a trivial change is mostly because
the streambuf must be aware of the existance of these Message objects.
If you would seriously consider to go for this approach then I am
willing to donate my dbstreambuf code.

Filters that can be implemented without the need to increase
the message size can then always work 'in place', without the
need for unnecessary copying.

Filters that need to enlarge a buffer also do not always have
to copy the data; when the message buffer is already large enough
then no copying is needed. For example, to transform a compressed
UNIX text file to a compressed windows text file:

file >> expand_msg(2000) >> decompress >> add_cariage_return >> compress >> file

Only the first filter would copy the data (would call new char [2000] and
copy the size of the real message, which can be much smaller - leaving
rest of the buffer uninitialized). decompress then would not have to
allocate new space - and neither would 'add_cariage_return' etc.
[ However, this still isn't satisfactory because a decompress filter will
ALWAYS have to copy the data. Better would be to be able to pass a
size to the decompress filter:
file >> decompress(2000) >> add_cariage_return >> compress >> file
or, just tell the decompress filter that it should try to make
the resulting message have a buffer that is at least 1 character
larger than the size of the resulting message:
file >> decompress(1) >> add_cariage_return >> compress >> file
Then really only a single copy is needed. On the other hand, the
first is also already advantegous in that only a single allocation
is needed: malloc is slow too *).]

Carlo Wood <carlo_at_[hidden]>
*) Which seems to indicate that the Message object should
   have an Allocator template parameter (ie, to implement
   memory pools).

Boost list run by bdawes at, gregod at, cpdaniel at, john at