Boost logo

Boost :

From: Eric Hill (eric_at_[hidden])
Date: 2008-05-20 18:09:05


Greetings everyone,

I need to stream a data over a very small data connection (think
cell-phones and GPRS). The type of data will vary, but will typically
contain strings, numbers, and short groups of these basic types. The
application on both ends of the pipe will be written in C++, so I've
loosely decided on the boost::serialization library as it virtually
eliminates all of the code I would have needed to manually write.
(Awesome!)

I've made up a bunch of test archives, and I'd like to get some feedback
on a possible optimization, or at least specialization, of the code that
streams out STL collections.

The current serialization methodology for STL containers saves the size of
the container followed by each item inside the container. This code is
also used for std::string as it behaves like an STL container of
characters. The down-side of this is that each string uses a minimum of 8
bytes (32-bit integer) plus the string payload.

Proposal: Write out a single byte that indicates the number of elements to
follow. If the number of elements is 255 or more, write out a single byte
0xF, followed by the size_type indicating the correct count. Reading
follows the same pattern in reverse. Read a single byte. If the byte is
0xF, read size_type, otherwise you have the count.

Simple example from my problem domain: If I have a list of three-letter
bin locations in a warehouse and each bin contains a quantity of a
specific item, I will have the following data to send:

DER: 427
ALU: 582
COM: 821
TER: 991
FLO: 0
TER: 298
ALP: 332
PED: 773

Using the boost serialization framework, that data becomes 160 bytes (8
for the size, 8+length for each string, and 8 for each integer). Using my
proposal, the data size drops to 97 bytes, nearly 60% in data reduction.

I theorize that many serialized strings and collections are less than 255
items or characters (especially in my problem domain) and that this
technique will save us many on-the-wire bytes over time.

A) Do you think this is a reasonable addition/modification to the
serialization library?

B) Is there any way to add this functionality to the serialization library
without breaking existing archives? I see a call to get_library_version
in the code, but I'm not sure what is the purpose of this statement.
Anyone?

Thanks,
Eric


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk