Boost logo

Boost :

From: pedro.lamarao_at_[hidden]
Date: 2005-06-14 07:08:51


Scott Woods wrote:

> 3. Lastly, some of the observations (while excellent) seem a bit "macro"
> when a wider view might lead to different thinking. What I am trying to
> suggest here is that the time spent in vector<>::resize is truly surprising
> but its also very low-level. having been through 4 recent refactorings of a
> network framework, I have been surprised at the gains made in other
> areas by conceding say, byte-level processing, in another area.

I undersand that these difficulties are orthogonal to any IOStreams
issues in the sense that everyone obtaining data from an unknown source
must deal with them.

This "buffering problem" is the problem that leads people to design
protocols with fixed sizes everywhere.

> To make more of a case around the last point, consider the packetizing,
> parsing and copying thats recently been discussed. This has been related
> to the successful recognition of a message on the stream.
>
> Is it acknowldeged that a message is an arbitrarily complex data
> object? By the time an application is making use of the "fields" within
> a message thats probably a reasonable assumption. So at some point
> these fields must be "broken out" of the message. Or parsed. Or
> marshalled. Or serialized. Is the low-level packet (with the length)
> header and body) being scanned again? What copying is being done?
> This seems like multi-pass to me.
>
> To get to the point; I am currently reading blocks off network connections
> and presenting them to byte-by-byte lexer/parser routines. These form
> the structured network messages directly, i.e. fields are already plucked
> out.
>
> So which is better? Direct byte-by-byte conversion to structured network
> message or multi-pass?

I understood you correctly, I might rephrase that to myself like Do we
read the whole message before parsing, or are we parsing directly from
the data source?

If we parse directly from the data source, we must analyze byte by byte,
and so obtain byte by byte. If we want this, we will want a buffering
layer to keep the amount of system calls to a reasonable level.

streambufs provide such a buffering level, with IO operations proper for
lexical analysis at such a level: sgetc, snextc, sbumpc.

If you remember streambuf_iterators exist, and imagine a multi_pass
iterator (hint, hint), many other interesting things come to mind.

If we read the message completely beforehand, we must know how much we
have to read, or we must inspect the data source in some way to watch
for "end of message".

If we have control over the protocol design, we can make it "fixed
size". Making if "fixed size" would mean prefixing a "payload" with size
information. If that size information is also fixed, then we're set. If
not, we'll have to parse on the fly at least this prefix.

I've seen at least one protocol naive enough to throw an int as the
"prefix" directly into the data sink. Luckily, every machine involved
was Intel x86 running some Windows server.

If we don't have control over the protocol design, we can apply another
layer, encapsulating the protocol message inside a "control message"
providing the fixes sizes we need. To this control message the previous
considerations would apply. This has been suggested here many times.

So, after reading every byte we need, we'd start parsing over the
sequence in memory, instead of the "sequence" from the data source.

streambuf's provide for this, with the sgetn operation, and even the
possibility of shutting the buffering down.

At this point, we have read the same amount of bytes from the data
source, in whatever order. But the amount of calls made to the IO system
service is not the same, and the fixed size approach is more efficient
in this regard.

Also, the fixed size approach solves the "buffering problem" since we
make no resizings along the way. C++ people, blessed with std::vector,
already have a mechanismo to do away with such weirdness; think about
how you do it in C.

But such a design suffers elsewhere. Let me argue a little against it.

First. We, unfortunately, can't pass std::vector to the operating
system, so, at some point, we are allocating fixed sized buffers, and
passing it to our IO primitives. There is no escape.

If you are initializing std::vector with the correct size and giving
&*begin() to these primitives, well... Why not allocate with new? If you
are allocating it with whatever default size and resizing it later, you
are losing part of the proposed benefit.

When we're about to throw a message to the network, how do we know what
size it is? If our message is composed of, say, a string, another string
and an int, are we going to call string::size() twice for every message
dumped? Is the int representation fixed in size? Is this size enough for
MAX_INT?

HTTP is like that; the size of the resource being sent to the client is
present in a header field. If you think that is easy because HTTP
servers can get that from the filesystem, get acquainted with server
side scripting, and think again. HTTP headers, on the other hand, must
be parsed at least a little, to look for "end of header". SMTP clients
try to hint at the size of the data being sent, but that is not
authoritative. There is also "end of data" marks in SMTP.

Take a more complicated protocol like the Jabber protocol, whose
messages are effectively XML nodes. Are we going to traverse the tree to
find out the payload size? If we have a XML processor, why not apply it
directly to the network? Check out that protocol to see how powerful the
message format is before complaining about "weird protocols that only
bring more trouble".

We don't need to go that far, of course. Mail messages today are already
trees of MIME parts. SMTP makes no guarantees the SIZE hint must be
respected. SIZE hints may not even be present. What will the server do?

I've seen a SMTP proxy service, whose job was to transform a mail
message on the fly before it hit the SMTP server, suffer badly with
this. That proxy won't be sending any SIZE hints.

My point is, writing a generic networking library for generic client
code dealing with generic protocols must provide for every kind of
model. Impose a new layer of headers, and you're not generic anymore.
Force everyone over a buffer, and you're not generic anymore. Put
everything over asynchronicities and you're not generic anymore.
(Please, think of us regular, blocking, thread-loving people, I beg you!)

And this is all about getting and putting stuff to the network, without
considerations about whatever necessary character coding conversions
must be done from a certain place to a certain other place, which could
perfectly well increse or decrease the size in bytes of a certain
representation.

Also, think of a system with ISO-8859-1 default locale; what do you do
when you want to get a web page from a server providing UTF-8 pages?
Good luck with that. Those dealing exclusively with in-house designed
protocols live in heaven, safe from this kind of thing.

If you are on the Internet, you have very little guarantees. It's hell
out here, sir.

--
 Pedro Lamarão

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk