Boost logo

Boost :

From: Ian McCulloch (ianmcc_at_[hidden])
Date: 2004-05-05 12:18:03


Matthias Troyer wrote:
[...]
> As I see it the current serialization library allows both options,
> depending on your preferences. Any archive may choose which types it
> view as fundamental, but both have their disadvantages:
>
> * serializing int and long always as 32 bit and long long as 64 bit has
> the following problems:
> - on 64-bit architectures a long can be 64 bit, and the non-standard
> long long might not be supported by the compiler
> - serializing the size of a container as a 32 bit signed integer will
> prohibit you from serializing container with more than 2^31 entries.
> Note that we already encounter vectors with larger sizes in some of our
> codes.
> - serializing std::size_t with values larger than 2^32 might not be
> possible at all in a portable way.

Yes. One possibility would be to serialize everything as the largest
plausible width (say 64 bits), and fail at runtime if we try to read
something from the stream that would overflow. This is about as portable
as we can get I think - most/all platforms of interest have a 64-bit type,
and a vector with more entries than will fit in a size_t won't run no
matter what binary format we use. This would be unacceptably slow though
in some important circumstances (say, MPI on an SMP machine or a fast
network), and result in a pessimistically big archive. OTOH, I'm not sure
that high-performance MPI is on the radar for boost::serialization...

In my own codes, the format of a serialized object defaults to whatever is
closest to the 'native' format; LE_LP32 on x86 and LE_LP64 on alpha (and
presumably x86-64). It would be straightforward to add formats for other
common platforms. In principle, if a calculation running on an x86 machine
fails because it tries to expand a container beyond 2^32 (or likely, 2^31)
entries, then it would be possible to take the last checkpoint file and
continue the calculation on a 64-bit machine, were there would be no such
limitation. The later checkpoint files would contain objects seralized in
LE_LP64 format. Trying to restart those checkpoints on a 32-bit machine
would cause a boost::numeric_cast<> to fail, but only if there are any
(64-bit) size_t records in the stream that are larger than 2^32. I don't
regard this situation as substantially different from, say, trying to read
an archive onto a machine that doesn't have enough memory. If there are no
such overflows, then it would run with no problems.

>
> * serializing int32_t and int64_t as the basic types causes other
> problems as you stated:
> - the serialization of int, short and long becomes non-portable since
> they might be int16_t, int32_t or an int64_t depending on the platform.
>
> Whatever choice we pick there will thus be issues that one has to be
> aware of and one has to be careful in the choice of fundamental types
> one serializes. This is no problem as long as the application
> programmer has full control over the types. In serializing the standard
> containers this is however NOT the case since there the size is
> serialized as an int, which will not work for containers with more than
> 2^31 entries. Thus one will either have to reimplement these
> serialization functions, or be able to specify, e.g. by traits, which
> type should be used to serialize the size of a container.

Its not so much the application programmer that has control over the types,
rather the archive designer has to dictate to the application programer
what types can be serialized.

For sure, the serialization library should allow some choice for the
representation of size_t, since using an int rules out serializing large
containers on a 64-bit machine.

Cheers,
Ian


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk