Boost logo

Boost :

From: Ares Lagae (ares.lagae_at_[hidden])
Date: 2003-01-12 08:21:49


On the yahoo groups, i followed some discussions about a possible boost serialization library
i followed them with great interest because im also working on a serialization library
for what its worth (im not a boost developer), these are some thoughts about serialization:

1) Serialization is based on reflection (introspection, MOP, ... whatever) and to implement serialisation there has to be a reflection library first. Try to decouple serialization from reflection as much as possible. Because serialization needs reflection
anyway and there are other uses for reflection, it is better to stay more general and decouple the two (eg one could use the reflection library withouth the serialzation lib, but not the other way around).

2) Reflection requires to know about properties of classes, like the base classes, the data members with their types and names, and the member functions, with their types and names.
For example, one could implement the classes Class, BaseClass<class, base>, DataMember<class, type>(data member pointer, name) and MemberFuncion<class, return, args>(member function pointer, name).
For each class C a method describe() could be implemented, making a Class<C> object, and adding to that object BaseClass<C, X> for all the baseclasses, DataMember<C, type>(data member pointer, name) for all the data members and MemberFuncion<C, return,
args>(member function pointer, name) for all the member functions.
Aditionnaly, DataMember<class, type>(data member pointer, name) should have functionality type get() and set(type value), MemberFuncion<C, return, args>(member function pointer, name) should have functionallity retrun invoke(instance, args).
The Class<C> object should support methods like getDataMember(name) and getMemberFunction(name).
This describe system only relies on things the compiler knows at compile tyme, and therefor one could imagine a compiler that generates the describe method automatically. There is no need to add aditional (non static) class members, because reflection
inherently is about classes, and not about class methods.

3) reflection is the difficult part, serialisation is the easy part. Given a reflection system, all the serialization method must do is query the list of data members, if the type of the data member is primitive, the data member should be serialized
directly, if the data member is not a primitive type, we should again query the data members of the type, and repeating the process in a recursive way.
Clearly for pointers some care must be taken. When serializing a pointer, the system should create a handle. The handle is constructed with an id (each handdle has an unique id) the address of the pointer, and the data the pointer points to. The handle
then is serialized and remembered by the serialization subsystem. Next time we encounter a pointer, we check if we already have a handle for the pointer, and if we have, we only put the handle ID in the serialization stream. This ensures the data can be
deserialized properly, and withouth overhead.

4) there are some pitfalls involved with serialization

- some data members can not be serialized in an easy way (eg pointers, because the only have meaning on the local machine), and some data members can not be serialized at all. For example sockets or open files. These are so called transient data members.
Due to restrictions of c++, at first sight also references can not be serialized. A sheme to solve this is quite difficult. Althouh, one knows that references are typically initialized in a constructor, and that transient members are typically created from
non-transient data (eg, an attribute char * fileName would be non-transient, but the file pointer for it can be created if the value of fileName is known). So we could imagine serializable classes to have a method onSerialized() to initialize transient
data members. I would have to see how java handles transient data members.

- how does deserialisation and construcors go together ? I dont see "the approach" to handle this, but this is what i think about it: when an object is deserialized, one allocates memory for the object withouth invoking a constructor (we do not want to
call eg the default constructor and initialize all attributes with default values to overwrite them next) or with a contructor invokation containing no code, set the data members in the serialization stream, and then call a constructor (knowing that data
members have been deserialized) (to eg init reference data members and transients).

- in this context is would be a good practice not to pass transient data across classes. Eg, instead of passing socket handles, one could make a Socket wrapper class, and pass this one instead. The Socket class will know how to initialize his transient
data (for example create a socket from a deserialized hostname data member).

- the ultimate goal of serialization is not writing classes to disk (this is one of the goals) but store classes in a generic way. When one serializes class on a little endian machine, and send it over the network to a big endian machine, it should be
deserialized properly. This is one of the most important properties of serialization and must be adressed. Although, when one knows the class will only be serialized to and from the local disk on the same machine, one wants to avoid the overhead of
platform independant writes. The serialization subsystem should support different external formats. According to me, these are essential:
* local binary format, only to serialize fast eg to local disk, this format should not be used to eg transmit data to other hosts
* local text format, same notes as previous, but lets the user easilly edit the data. text streams using using the local locale could be used
* global binary format, XDR comes to mind
* global text format (one cannot use regular text streams here), XML comes to mind

- serialization typically deals with only one instance of an object. This object, and all objects connected to it, are serialized (flattened). This serialization process is atomic. The object structure (eg pointers in the object) can not (must not when
using eg threads) change during serialization. In the same way, different serialisation calls are not atomic, and therefore must be completely independant. Consequent serializations of the same instance of an object to the same archive are and must be
totally independant. For example, suppose object A contains a BPtr *B, and we serialize A. (*BPtr) will be serialized also, in the same atomic "serialization unit", if we next serialize A again, (*BPtr) will also be serailized again. Questions like "is the
object pointed to by A::BPtr serialized again the second time i serialize A ? Also when it points to the same object ?" are irelevant. The anwser is "yes" twice because it are 2 different serialization units and not one atomic serialization unit, and
therefore the system has no control of what happens in between.

just my 2 cents,
Ares Lagae


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk