Boost logo

Boost Users :

Subject: [Boost-users] Delimiting protocol messages (was [asio] read_some() splits data)
From: Marsh Ray (marsh_at_[hidden])
Date: 2011-05-12 13:10:50


On 05/11/2011 09:36 AM, Andrew Holden wrote:
> On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
>> Then messages of length not multiple of 128 (BUFFER_SIZE)
>> will not be read - tested it, anyway, after
>> last socket::read_some() (with readBytes> 0&& <
>> BUFFER_SIZE) next socket::read_some() never ends.
>
> It would probably help to understand that TCP has no concept of a
> "message". Anything you write to a socket is appended to a stream of
> *bytes*.

Alternatively, we could say that TCP, in fact, does have a well-defined
concept of messages: they are all exactly one byte long.

> Several subsystems on both the sending and receiving computers

... and sometimes boxes in the middle ...

> will have the option of splitting and combining adjacent buffers with no
> consideration to how big each individual write was.

I've done a little protocol stuff with ASIO now and I must say it's a
lot of fun and I can't go back to doing it any other way.

> On further thought, I think I see the problem (and I apologize for the
> bad recommendation in my last email). Your sender somehow needs to
> communicate the message size or flag the end of the message. A partial
> list of options includes:

The pattern I encounter over and over again (often at multiple levels in
a protocol) is:

class protocol_layer_context
{
vector<uint8> buffer;
void on_received_data(vector<uint8> & rx_bytes)
{
     buffer.append(rx_bytes);

     // perhaps a virtual override.
     size_t msg_len = this->parse_len_from_start_of_buffer();

     if (buffer.size() <= msg_len)
     {
         vector<uint8> msg_buf =
             consume_data_from_front_of_buffer(buffer, msg_len);

         // perhaps a virtual override
         this->process_complete_message(msg_buf);
     }

     // post another ASIO read request
     this->request_more_data();
}
...

But there are some important issues with this naive pseudocode:

1. It can result in recopying the data a bunch of times for every
protocol layer, killing performance.

2. It's susceptible to a denial-of-service (DoS). Bad guy can send trick
you into allocating all your memory.

3. Sometimes the length of a message is stated at the beginning of the
message, sometimes it isn't known until the end.

4. No processing happens on the message until it's completely read, but
some protocols really need the receiving endpoint to process it
incrementally.

5. Error handling

6. Optimal threading

7. Etc.

We find bugs in exactly this logic all the darn time. Often the data
being received is untrusted and possibly malicious. Real-world protocol
implementations will commonly crash under fragmentation fuzzing,
sometimes resulting in exploitable security holes.

In a sense, this is the general refactoring problem of
'incrementalizing' a parsing function by moving all its state from stack
variables into a longer-lived context object.

We've seen it done successfully with coroutines, but that's not a
commonly accepted solution because, frankly, the native C/C++ runtimes
have not yet given coroutines the love (i.e., portability and
performance guarantees) they really deserved.

If someone figured out how to leverage generic techniques to handle just
the unidirectional message delimiting problem in a bulletproof way I
think it would make a really great boost library.

- Marsh


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net