Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2002-11-24 10:28:50


After spending more time with the serialization library, implementing
an archive
for the XDR format and considering how I could read my old archive
files using this
library instead of my own one, I can now give a more detailed review of
the
serialization library. I will keep those parts that I discussed in more
detail
previously short and focus on some new issues.

To start with a summary, I see problems in the following areas:

1. Compilation problems - minor problems
2. Code quality - solvable problems
3. Does not work on all platforms - solvable problem
4. Interface design: there are some show-stoppers here for now
    a) primitive types: code is not portable at the moment
    b) performance: need improved methods for large data sets
    c) binary archives: design problems in current version
    d) small objects: optimization could easily be achieved by a traits
class
       instead of reimplementing all the std-library serializaing for all
       small object classes individually
    e) serialization template specialization versus free functions?
5. Overall design: I want a better factorization of the library into
    a basic library and separate support for pointers. Also, one needs to
    be able to override archive preamble, and functions called for
    i) new class encountered ii) start/end of an object serialization
    reading the version number, ...

I value all the work that Robert put into this library and am willing
to help
improve it further. During this review I had some fruitful exchange of
comments
with Robert, which solved all of the compilation problem. I thus hope
that it
will be possible to get the library into a better shape soon. For the
moment
my overall vote is however to have another round of review after the
interface
and implementations have been improved. I vote for urging Robert to
improve the
library (and am willing to contribute to the improvements), but have to
vote a clear
NO to this version since I want to have another review of an improved
version.

Below I will elaborate in more details on the issues summarized above:

1. Compilation problems
=======================

Most compilation problems have been fixed in a revised version
serialization7.zip
that Robert sent me last week. I appreciate his efforts in getting
these things
fixed quickly. There is just one minor and one major problem left as
far as I can see: In boost/void_cast.hpp there is a typename missing on
line 148:
   is_polymorphic<typename remove_const<FROM>::type >,
                  ^^^^^^^^

The shared_ptr demo does not compile:

boost/shared_ptr.hpp:297: `boost::detail::shared_count
boost::shared_ptr<A>::pn' is private
demo_shared_ptr.cpp:183: within this context

Why do you want to access a private variable here?

Otherwise I have not encountered any problems

2. Code quality
===============

I do not want to go into details at this stage since I believe that a
redesign
of some aspects of the library is needed first. However code fragments
such as:
line 95-96 of archive.cpp seem unacceptable to me:

        // note breaking a rule here - is this a problem on some platform
        is.read(const_cast<char *>(s.data()), size);

3. Does not work on all platforms
=================================

Running the test program on MacOS X 10.2 (using gcc 3.1) gives:

*** testing native binary archive
../../boost/serialization/archive.hpp:986: failed assertion `is.good()'
Abort

It thus seems there are problems with the binary streams. Note that
under
Darwin (FreeBSD in general?), the binary specification is just ignored
as
can be read on the man page for fopen:

     The mode string can also include the letter ``b'' either as a third
char-
      acter or as a character between the characters in any of the
two-charac-
      ter strings described above. This is strictly for compatibility
with
      ISO/IEC 9899:1990 (``ISO C89'') and has no effect; the ``b'' is
ignored.

These problems need to be sorted out, but I do not consider them urgent
since
a redesign of the binary archives is urgently needed beforehand.

4a. Interface design: primitive types
=====================================

There are two questions here: a discussion which types to use as
primitive types
and problems an incompatibilities of the current version

4a.i) which primitve types to use
---------------------------------

We should discuss whether to use short, int, long ... as the primitive
types
or int8_t, int16_t, int32_t, int64_t. The latter makes it easier to
write
portable archives, the former seems more natural. I can accept both
choices but
we should not mix the two as is done now

4a.ii) problems with the current primitive types
------------------------------------------------

There are two problems:

* A serialization of bool is missing - easy to fix

* The code will not compile on platforms where long is 64-bit:

        virtual basic_oarchive & operator<<(long _Val) = 0;
        virtual basic_oarchive & operator<<(int64_t _Val) = 0;

    The signature of these two functions is identical when int64_t is
typedef-ed
    to be long. See discussion under 4a.i)

4b. Interface design: performance: need improved methods for large data
sets
========================================================================
====

As mentioned in previous posts, additional functions e.g. load_array
and save_array
need to be added to allow efficient serialization of large data sets.
The
default version could just use operator<< or operator>> as in:

virtual void save_array(const int* p, std::size_t n) {
   for (std::size_t i=0;i<n;++i)
     *this << p[i];
}

and this would thus not incur any extra coding work for people not
interested.
Serialization of containers such as std::vector, or ublas or mtl vectors
and matrices can make use of this extra function transparent to the
user so that
the interface would also not become harder to understand for the
library user.

4c. Interface design: binary archives
=====================================

I see two problems that can easily be fixed:

4c.i) Problems with the current version
---------------------------------------

Currently the library wants to protect me from problems with binary
incompatibilities by checking if the sizes of the primitive types are
the same on the platform on which I read or wrote. I believe this to be
a misfeature. Consider two platforms

   A: 32-bit long, 64-bit long long, int64_t typedef-ed as long long
   B: 64-bit long, int64_t typedef-ed as long

Currently when I try to read an archive from platform A on platform B,
the library
aborts because of incompatible primitive types. However, as a power
use, knowing about
portability problems I did not use long or long long in my codes, but
always int64_t,
and should thus have no problems (ignoring byte order issues). The
library should
thus allow me to read the file although the primitive types are
different!

Considering this problem, which we face daily (we transfer files between
Linux PCs, Macintosh, SUN and Alpha workstations, Cray, HP, Hitachi and
Fujitsu
supercomputers) has made us choose int*_t as the primitive types for
serialization
in our library (which is in use since 1994). See 4.a.i).

4c.ii) Change of the binary archive class design
------------------------------------------------

I propose a change to the design of the native binary archives. While
implementing
operator<< to just call write_binary is common to all native binary
archive classes,
the specific implementation of write_binary can be different. I thus
propose to
factor out the operator<< implementations into a new base class
basic_obarchive,
which still contains write_binary as pure virtual function. From this
class we can derive:

stream_obarchive: serializing into a stream as done now
file_obarchive: serialization into a file using fwrite
buffer_obarchive: serialization into a memory buffer using e.g. memcpy
...

and similar for the input archives.

4d. Interface design: small objects
===================================

I have mentioned this in a previous post. Instead of requiring the user
to
reimplement the serialization of standard containers for all small
object types
for which the versioning and pointer system should be bypassed, a
traits class
can be added and the optimized serialization of all containers of small
objects
implemented in the library. Note that the traits class needs to be
specialized
only for those objects for which the user wants to optimize
serialization, while
no effort is required at all if the standard serialization method is to
be used.

4e. Interface design: serialization template specialization versus free
functions?
========================================================================
==========

There are two ways to implement non-intrusive serialization. One is via
specialization of the serialization class, the other via free functions
called
e.g. serialization_{save,load,version}.

I myself would give preference to free functions but can live with
either variant.
The current library is however not consistent since

* serialization of normal classes goes via specialization of the
serialization<T>
   class

* serialization of template classes goes via overloading of the free
function
   serialization_detail::save_template(), ...

This is unacceptable and a consistent method should be found.

5. Overall design:
==================

Finally, before being able to vote yes to an inclusion in boost I would
need to see some changes to the global design of the library.

5.a) Factor out the pointer serialization features
--------------------------------------------------

It should be straightforward to factor out the serialization of pointers
and to split the library into archives for the serialization of basic
types,
and an add-on for the serialization of pointers. That way users who do
not
need the elaborate pointer serialization mechanism can do so and not
incur
the performance penalty (in terms of memory and speed) of to this
feature, while
in cases when it is needed we can add it on to he archive.

5.b) Allow overriding of preamble, etc.
---------------------------------------

I would like to have more control over some aspects that are currently
hardcoded
into the library:

* writing/reading the preamble
* obtaining the version number of a class
* starting/finishing writing an object
* a new type is encountered

The motivation is very simple: We have hundreds of gigabytes of data
lying around
in tens of thousands of files that could easily be read by the
serialization archive
if there were not too small differences:

i) I wrote a different preamble
ii) I only wrote one version number for all classes in the archive
instead of separate
     version numbers for each class
iii) no information was written when a new class was encountered

Since otherwise the interface is nearly identical (many classes contain
a load
and a save function, albeit with a different name for the archive
classes), changing
all my codes over to a boost::serialization library would be easy if it
weren't for
the three issues above.

CONCLUSIONS:
============

Since these are major changes I would like to see a new review after
they
are implemented and thus vote NO for now. However I am willing to help
Robert with
implementing the changes, improving the library, and am willing to
discuss further.

Best regards,

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk