Boost logo

Boost :

From: JOAQUIN LOPEZ MU?Z (joaquin_at_[hidden])
Date: 2007-05-05 09:13:51


----- Mensaje original -----
De: Peter Dimov <pdimov_at_[hidden]>
Fecha: Sábado, Mayo 5, 2007 12:17 pm
Asunto: Re: [boost] Serialization support,Was: [BoostCon07][testing]
        Session reminder.
Para: boost_at_[hidden]

> Sohail Somani wrote:
> >> -----Original Message-----
> >> From: boost-bounces_at_[hidden]
> >> [mailto:boost-bounces_at_[hidden]] On Behalf Of Peter Dimov
> >
> >> If your serialization methods
> >>
> >> - allow creation of objects whose invariants do not hold, or
> >> - expose implementation details in the external representation,
> >
> > What do you mean by the second line?
>
> Consider for example a set<int>. The proper way to serialize it is
> as a sequence of values. This is what the user sees, and this
external
> format is robust against changes in the implementation. If you
> serialize it as a tree of Node classes, this would make it quite
> hard to switch to a skip list representation later.

I'd like to join this intrusive vs. non-intrusive serialization
discussion from my experience in providing serialization support
for Boost.MultiIndex. As I see things, there are two related but
different ways in which serialization support can be said to be
intrusive:

  a) data intrusive: the stuff serialized reflects the internal
  structure of the class. The set<int> example proposed by Peter
  above illustrates a case of (unwise) data intrusive serialization.
  b) interface intrusive: the serialization algorithms cannot be
  implemented by using the class public interface alone. For
  instance, up to Boost 1.33 (if my memory serves me well),
  serialization of shared_ptrs was provided in an interface
  intrusive way because no non-intrusive approach was found --this
  has fortunately been corrected now.

Usually, data intrusive serialization implies interface intrusive
serialization, but *not* the other way around: Boost.MultiIndex
serialization support is indeed interface intrusive but not
data intrusive; let me explain why I had to do things this way.

No one doubts non-intrusive serialization is the preferred
approach, when feasible. Now consider the process of serializing
a multi_index_container: we naturally want deserialization to
reconstruct the order in which the elements were arranged
in *every* index of the container (hashed indices excluded
from this guarantee, as relying in the arbitrary order they
provide is unsound to begin with). If we follow the natural
approach of saving the sequence of values as traversed by
some of the indices (let's say index #0), this guarantee is
not held, let's see the problems encountered with each index
type:

* Ordered indices: no problems with *unique* ordered indices,
since the order there is strictly determined by the values
contained. But what about non-unique ordered indices? Consider
this container:

  multi_index_container<
    int,
    indexed_by<
      ordered_non_unique<int_modulo_3_extractor>,
      ordered_non_unique<int_modulo_3_extractor>
>
>

where the elements are deemed equivalent if they have
the same value modulo 3. Now suppose we have the values
0,3 and 6 in the container, and index #0 lists them in the
following order:

  0 3 6

What can we say from this info about the traversal when done
through index #1? The answer is: absolutely nothing, whatever
permutation of these elements could be validly exposed by
index #1. These variations from index #0 result from the fact
that the values could have been inserted with hinted insert()
through either of both indices. The final state is a convoluted
function of the insertion history.

* Sequenced and random-access indices: here it is even clearer
that the order maintained by these indices cannot be inferred by
those of other indices, since after insertion elements can be
freely relocated.

So, in order to provide our desired level of functionality
(perfect reconstruction of every index traversal order) we
cannot just save the elements as traversed by index #0, but we
must somehow codify the variations from every other index wrt
to the traversal order implied by index #0, when these
variations are not unique --this is indeed what's done in an
efficient way, using LIS (longest increasing subsequence)
algorithms. So far, this approach is not data intrusive, since
the information stored does not reveal any particular
implementation detail and reflects only the nature of traversal
orders allowable by index semantics. Now the remaining question
is: can we use this info to reconstruct the traversal orders
in a non interface intrusive way, i.e. by resorting only to the
public interface of each index type?

* Ordered indices: No; once an element is inserted into a
multi-index container, there is no way (with public interface
methods) of changing its order with respect to other equivalent
elements in a non-unique ordered index, short of extracting and
reinserting the element again, which, besides being terribly
inefficient, destroys the relative element position gotten so
far in other indices, in a sort of cacth-22 situation.

* Sequenced indices: Yes. splice() (or relocate())
facilities can be used to alter the traversal order of a
sequenced index without touching the rest of indices.

* Random-access indices: Yes, with problems. splice() member
functions are available here as for sequenced indices, but
relocating elements around is a O(n) operation, implying a
O(n*n) complexity for the task of reconstructing the entire
index. An interface intrusive scheme can do this in linear
time.

So, I had no other choice but to implement serialization
support for Boost.MultiIndex in an interface intrusive way.
The morale of the story is: for rich-state classes where the
exact state of an object depends heavily on its past history,
non-intrusive serialization can be either algorithmically
unfeasible (it is hard or impossible to reconstruct the
history from the current state) or potentially less efficient
than a interface intrusive approach. This does not mean that
the class interface or the serialization support implementation
are "broken".

Just my 2c, sorry about the long post.

Joaquín M López Muñoz
Telefónica, Investigación y Desarrollo


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk