Boost logo

Boost :

From: Dan Notestein (dan_at_[hidden])
Date: 2006-08-17 17:18:33


The current implementation of serialization has some limitations when
handling contained data that I would like to remove. By contained data,
I mean data that exists inside the allocated boundary of a containing
object that is also being serialized. For example, in the code:
 
class TMyClass { int x; int* y; }
 
x is contained data of TMyClass. The data pointed to by y is not
necessarily contained data (and in general, will not be).
Similarly all the elements in an array of obects is contained data.
 
One problem in the current implementation is that pointers to contained
data can only be serialized after the contained data is serialized.
Also, special code needs to be written to handle pointers that point to
elements in arrays (special case code already exists for handling
pointers to elements in STL vectors, but this code could also be made
more efficient).
 
Here's a rough propsoal for changing serialization to handle these
issues. It's not a complete proposal, just a starting point for
discussion. The basic idea is to serialize in two passes. In the first
pass, we "walk" the objects using normal serialization order from the
root and determine which objects are contained by other objects. In the
2nd pass, we serialize as we do in current implementation, except that
instead of always serializing an object the first time we encounter a
pointer to it, we only serialize such pointers if the object is not a
contained object. Below is a more detailed description of the
algorithm:
 
Pass 1:
-----------------
1) For each object to be serialized via a pointer, check if in
ObjectManager set. If not, add the object's TObjectInfo to the
ObjectManager.
 
class TObjectInfo
{
  int ObjectId; //consecutively assigned when object is first added
  void* Object; //beginning of object boundary
  void* ObjectEnd; //end of object boundary
  TClassInfo* ClassInfo;
 
  int OwnerId;
  unsigned int OwnerOffset;
};
 
class TClassInfo
{
  int ClassId; //consecutively assigned when first object of class is added
  TSerializationFunctionPtr SerializeFunction;
};
 
Data members of oarchive (similar members already exist with somewhat
different implementations):
 
set<TObjectInfo> ObjectManager;
set<TClassInfo> ClassManager;
vector<TObjectInfo*> SortedObjectInfo;
 
2) Create a vector SortedObjectInfo containing the TObjectInfo from the
ObjectManager, sorted by the object addresses and mark all objects in
this vector which are contained in the range of other objects by
setting their OwnerId (id of containing object) and OwnerOffset (byte
offset of contained object inside container). If two objects have the
same address, the object with the larger size "contains" the smaller
object. If multiple objects contain an object, the largest container is
the owner. Generate a warning and/or exception if an object is
partially contained, but not fully contained by any object and treat it
as an uncontained object (overlapping data will be duplicated).
 

Pass2:
-----------------
Starting from the root object again, serialize each non-contained object
as in existing serialization implementation mostly. That is, the first
time a non-contained object pointer is encountered, write out the
actual data, and on subsquent encounters of that pointer, write out the
object id. For pointers to contained objects, always write the OwnerId
and OwnerOffset. Need some way to differentiate between ObjectIds and
OwnerIds when deserializing. Should be able to use the same mechanism
employed already to differentiate between object pointers and actual
object data during deserialization, I suppose.
 
Note: The ObjectId could potentially be eliminated and the original
object ptr used as an id instead, if desired, but the reproducability
of using an Id seems better for testing and it also makes for faster
deserialization, since we can build use vector lookup instead of set
lookup during deserialization to map between the id/old pointer and the
new location for the data.
 

Deserialization:
-----------------
class TObjectPointers
{
  void* Object;
  vector<void*> ObjectPointersToFixup;
};
 
Data member in iarchive:
std::vector<TObjectDependents> ObjectDepedents;
 
During deserialization of each tracked object, add it's newly allocated
location to the Object field of the ObjectDepedents vector (indexed
using object's id) and fixup any addresses in ObjectPointersToFixup.
 
Whenever we encounter an ObjectId during deserialization of an object,
check ObjectDependents[ptrId].Object. If not null, use this address for
the pointer fixup. If Object is null (because object pointed to has not
been loaded yet), save off the address of the pointer to the Object's
vector of pointers to be patched in
ObjectDependents[ptrId].ObjectPointersToFixup.
 
Similarly, whenever we encounter an OwnerId, check the ObjectDependents
vector, but in this case we need to add the OwnerOffset as part of the
fixup process.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk