Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2002-12-16 04:52:45


Dear Serialization-Boosters,

In the past weeks I have been to busy at work to contribute much to the
serialization debate, but I managed to find some time now. It seems to
me that the discussion is drifting too far into semantic debates and I
would like to refocus by proposing to split the problem vertically and
discuss a "bottom up" approach to a serialization library, where I
start with the most basic serialization and add higher levels on top of
the lower ones.

1) "Definition of serialization": I want avery broad definition:
Serialization for me is the conversion of the content of an object into
a sequential stream. (I am not talking about C++ I/O streams here. It
does not matter whether this stream is text or binary, or holes punched
into a tape - this is an archive specific implementation detail).
Deserialization is the reverse process of converting a sequential
stream into object contents. While in many applications the process has
to be reversible to be useful, this is not needed for all cases (e.g.
output for debugging purposes or output to be read by another program).
I hope that we can agree so far.

In the following I will use the term serialization to mean both
serialization and deserialization but will focus the examples on
serialization to keep the text shorter.

2) "Serialization engine". Next I propose that we agree on
serialization of the basic data types: char, short, int, long, long
long, float, double, long double and their signed/unsigned variants
where appropriate. The abstract archive class should provide for the
serialization of these basic data types and also contain optimized
functions to serialize contiguous arrays of these types (e.g. a string,
or an array of basic data types). The concrete archive class provides
for the actual serialization of these types into the archive format
(could be text, native binary, XDR, or whatever). As I became aware
here the main issue seems to be the optimized functions for arrays of
basic data types, which I view as being essential - otherwise I don't
believe there is any controversy so far. Since the implementation of
the higher level functionality of the archive classes in the following
will be built on top of this basic functionality and will write to the
stream only by writing these basic data types, I propose to separate
out the functionality of serializing basic data types into a
"serialization engine". That way, the format specificity (text vs.
native binary vs. XDR, ...) is encapsulated in the serialization
engine. If we agree on this, then we could start by defining an
interface for the serialization engine.

3) "Archive preamble": Next, there is the question of a preamble of the
archive. There we need flexibility to enable compatibility with formats
given by other applications and compatibility with legacy formats. I do
not view the standardization of a preamble a task for Boost. Rather,
the preamble should be archive-format specific and Boost just provides
the framework for many archives (as well as several useful standard
archive formats). Again, I believe that we have no disagreements here,
or am I mistaken?

4) "Serialization of UDT (user defined types)": is the next level up.
Since (as I did in my 8-year old serialization library), just
overloading operator<< for UDT will not allow the advanced
functionality provided by Robert's library, I propose to follow
Robert's ideas of a serialization<T> template or to implement similar
functionality using free functions.

5) "Versioning": The next level for me is versioning support. We have
discussed versioning support on a per-archive and a per-class level. I
would like to see both variants supported. Per-class versioning is more
flexible, but has two disadvantages: i) it introduces overhead and ii)
it writes extra information into the stream, which might make the
output incompatible with some applications.
Regarding i: we have to write the version number for each UDT
encountered, but want to write it only once per UDT. We thus have to
keep track of which UDTs have been serialized so far, and whenever a
new UDT is encountered, its version number must be written to the
archive. This introduces overhead, especially if many small objects
have to be serialized.
I see a two-pronged approach as the best solution:
a) both per-archive versioning, per-class versioning and no versioning
should be supported for compatibility with other formats (issue ii)
above)
b) if per-class versioning is used, it should be possible to turn it
off for some classes by a traits class - this will get rid of the
overhead (issue i) above) when versioning is turned off for a UDT.

6) "Advanced functionality": Robert's serialization library includes
further functionality, such as the serialization of pointers and of
polymorphic types. Here I want to focus on serialization of pointers. I
have not checked the implementation of Robert's library in detail, and
thus please correct me if I view this wrongly. Serialization of
pointers requires the conversion of a pointer to an integer. When
serializing objects, the archive thus has to keep track of the
addresses of objects, in order to later convert pointers into numbers.
This again introduces overhead. Robert addresses this partially in his
library by showing how to bypass this system for a UDT. His approach
however requires that if I want to bypass the pointer serialization
mechanism for a type T, then I have to re-implement serialization of
all standard containers of type T, such as std::vector<T>,
std::list<T>, std::stack<T>, ... for my type T. My proposal that I have
mentioned before is, to just add another traits type, which specifies
whether for a type T the pointer serialization scheme can be bypassed
(like versioning above) and a faster, optimized serialization used.

Thus to summarize, I propose to split the serialization archive into a
serialization engine, doing the serialization of the basic types, and
an archive class which takes the engine as a template parameter. The
archive class can contain as add-ons, versioning, pointer
serialization, polymorphic types, etc. It is important for me to have:

a) flexible preambles to the archive
b) versioning support either: never, per-class, per-object and the
latter selective specified by traits
c) it should be possible to turn off advanced functionality, such as
pointer serialization selectively by traits classes, which should
result in a no-overhead solution.

I believe that Robert's library has most of what it takes to get to
such a solution and am willing to help with implementing.

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk