Boost :

Date view	Thread view	Subject view	Author view

From: Matthias Troyer (troyer_at_[hidden])
Date: 2002-11-16 11:38:43

Next message: Gennaro Prota: "[boost] Re: Copy Constructible Concept"
Previous message: Jeff Garland: "RE: [boost] String algorithm library"
In reply to: David Abrahams: "[boost] Reminder: Serialization Library Review"

Before coming to my detailed review I would like to thank Robert for
all his work and for contributing it to Boost.

I want to start with a general comment about text vs. binary archives.
While a text archive is nice to look at, and can also be compressed
after writing to save disk space there is one important reasons why I
will always use binary formats:

i) when using serialization in passing messages between processes
(using MPI, PVM or another message passing library), I am often
restricted by bandwidth, especially when sending large vectors or
matrices. Then converting numbers to text instead of sending them as
binary numbers will make my codes slow down by a factor of 2-3.

ii) as my simulation programs usually run several weeks, I serialize
the state of the simulation at the end of every batch job (usually
every 24 hours). At that time 128-256 CPUs write about 100 MByte each
to a central file server, which severely overloads the server already
now. Thus, again I want to keep the file sizes as small as possible and
consider a support for efficient portable binary serialization formats
essential.

On Friday, November 15, 2002, at 05:13 PM, David Abrahams wrote:
> Here are some questions you might want to answer in your review:
>
> What is your evaluation of the design?

I like the design as apparently much thought was put into being able to
serialize polymorphic classes

I see two serious problems however that have to be addressed before I
can vote for inclusion into boost.

1.) The first problem are the basic data types used in the archive:

     virtual basic_oarchive & operator<<(signed char _Val) = 0;
     virtual basic_oarchive & operator<<(unsigned char _Val) = 0;
     virtual basic_oarchive & operator<<(char _Val) = 0;
     virtual basic_oarchive & operator<<(short _Val) = 0;
     virtual basic_oarchive & operator<<(unsigned short _Val) = 0;
     virtual basic_oarchive & operator<<(int _Val) = 0;
     virtual basic_oarchive & operator<<(unsigned int _Val) = 0;
     virtual basic_oarchive & operator<<(long _Val) = 0;
     virtual basic_oarchive & operator<<(unsigned long _Val) = 0;
     virtual basic_oarchive & operator<<(float _Val) = 0;
     virtual basic_oarchive & operator<<(double _Val) = 0;
     virtual basic_oarchive & operator<<(long double _Val) = 0;
     #ifndef BOOST_NO_INT64_T
     virtual basic_oarchive & operator<<(int64_t _Val) = 0;
     virtual basic_oarchive & operator<<(uint64_t _Val) = 0;
     #endif

shot, int and long have no defined bit size, and can thus never be used
for portable serialization. Imagine I use a platform where long is
64-bit, write it to the archive and then read it again on a platform
where long is 32-bit. This will cause major problems. It also prevents
the use of archive format that rely on fixed bit sizes (such as XDR or
any other platform independent binary format). My suggestion thus is to
change the types in these functions to int8_t, int16_t, int32_t, as was
already done for int64_t. That way portable implementations will be
possible.

The second big problem I see concerns speed and efficiency when large
containers of small classes have to be serialized.

2.) The second problem is speed when serializing large containers of
basic data types, e.g. a vector<double> or ublas vectors and matrices.
In my applications these can easily by hundreds of megabyte in size. In
the current implementation, serializing a std::vector<double>(10000000)
requires ten million virtual function calls. In order to prevent this,
I propose to add extra virtual functions (like the operator<< above),
which serialize C-arrays of basic data tyes, i.e. functions like

virtual void basic_oarchive::save_array(const int32_t*, std::size_t n)

which as default just call the operator<< n times, but which can be
overridden in specialized archive types. Examples are serialization
into a PVM buffer, where the pvm_pkint function accepts any array of
integers, or the use of memcpy to copy the array in native binary
format into a buffer, or binary write functions into a file. Having
these extra functions allows implementors of archives to make use of
fast functions for arrays of data.

In conjunction with this, the serialization for std::vector and for
ublas vectors, etc. has to be adapted to make use of these optimized
serialization functions for basic data types.

3.) While I consider the two issues above show stoppers for the use of
the library in serious scientific simulations with large data sets, the
next issue is not as serious but would be easy to address. It concerns
the serialization of very large numbers of small objects. The current
library shows a way to optimize this (in reference.html#large), but it
is rather cumbersome. As it is now, I have to reimplement the
serialization of std::vector<T>, or std::list<T>, etc., for all such
types T. In almost all of my codes I have a large number of small
objects of various types for which I know that I will never serialize a
pointer. I would thus propose the following:

i) add a traits class to specify whether ever a pointer to an object
will be serialized or if it should be treated as a small object for
which serialization should be optimized

ii) specialize the serialization of the standard library containers for
these small objects, using the mechanism in the documentation.

That way I just need to specify a trait for my object and it will be
serialized efficiently

4.) I am confused about registering polymorphic types. If one program
reads an archive written by another program, do both have to register
all the types in exactly the same order, or is it OK if the program
reading the archive registers only a subset of types and in another
order? I need that when an evaluation program reads only the first part
of a file (e.g. only the base class), without reading the rest of the
serialized data of the derived class. Can I read the base class from an
archive into which I serialized the derived class?
This is important for programs which just act on the information in the
base class.

5.) This is a point for discussion an no criticism about the library.
Instead of polluting the global namespace with a serialization class, I
would prefer to implement serialization with free functions save and
load instead.

6.) Finally, if I am correctly informed, the Java language includes
serialization and has a portable archive format. Could this library be
made compatible with this Java language standard, i.e. might it be
possible to create an archive format which can read such Java
serialization files?

>
> What is your evaluation of the implementation?

I would like to see a platform-independent binary archive format (e.g.
using XDR), but am also willing to contribute that myself once the
interface has been finalized.

>
> What is your evaluation of the documentation?

As was already remarked by others, I would like to see documentation on
exactly which functions a new archive type has to implement. Also, it
is unclear (see point 4 above) if the registration of types has to be
identical in all programs accessing the same serialized data

>
> What is your evaluation of the potential usefulness of the
> library?

extremely useful once the issues above have been sorted out
>

> Did you try to use the library? With what compiler? Did you
> have any problems?

I tried to use the library but could not compile it under MacOS X 10.2
with gcc 3.1
Compiling the file "demo.cpp" gives me the error:

../../boost/serialization/serialization_imp.hpp:382: sorry, not
implemented: `
tree_list' not supported by dump_expr

Thus unfortunately I could not do detailed tests of speed and file sizes

>
> How much effort did you put into your evaluation? A glance? A
> quick reading? In-depth study?

half a day now and more time with previous versions.

>
> Are you knowledgeable about the problem domain?

yes, I have implemented my own serialization library eight years ago
and used it for many years.

> And finally, every review should answer this question:
>
> Do you think the library should be accepted as a Boost library?
> Be sure to say this explicitly so that your other comments don't
> obscure your overall opinion.

Overall I like the library, and believe that it will not be hard to
address the issues 1-2 above which I consider show stoppers, and issues
3-4 which I consider serious.
I will vote yes if these issues can be resolved.

Robert, many thanks for your efforts - I would love to use the library
in my programs once it is suitable.

Best regards,

Matthias

Next message: Gennaro Prota: "[boost] Re: Copy Constructible Concept"
Previous message: Jeff Garland: "RE: [boost] String algorithm library"
In reply to: David Abrahams: "[boost] Reminder: Serialization Library Review"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk