Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2002-11-17 04:19:23


On Sunday, November 17, 2002, at 07:22 AM, Robert Ramey wrote:

>> From: Matthias Troyer <troyer_at_[hidden]>
>
>> Imagine I use a platform where long is
>> 64-bit, write it to the archive and then read it again on a platform
>> where long is 32-bit. This will cause major problems.
>
> Suppose you have a number on the first platform that exceeds 32
> significant bits. What happens when the number is loaded onto
> the second platform. Are the high order bits truncated? How
> do you address this problem now? If none of your longs
> are larger than 32 significant bits then there is not problem.
> If some are, the 32 machine can't represent them.
> This can't cause any problems you don't have already.

It can cause troubles, since for my portable codes I use int64_t or
int32_t to be portable. In order for the library to write numbers in
binary consistently we should also serialize them as 64-bit ore 32-bit.
How do you do that when the bit size can vary from platform to
platform? Do you check at runtime what the number of bits is and
dispatch to serialization for that number of bits?

No, it seems that in the binary file you just write out the sizes of
the integers and just fail the loading if the bit numbers don't agree.
Using the fixed-bit-size integers instead would allow your binary files
to be much more portable.

>
>> It also prevents the use of archive format that rely on fixed bit
>> sizes (such as XDR or
>> any other platform independent binary format). My suggestion thus is
>> to
>> change the types in these functions to int8_t, int16_t, int32_t, as
>> was
>> already done for int64_t. That way portable implementations will be
>> possible.
>
> I believe that you could just typedef the above on both platforms and
> use a text archive
> and every thing would just fine. The text archive represents all
> numbers
> as arbitrary length integers which would be converted correctly on
> save as well as load.

As I mentioned in the introductory part of my post, text archives are
much longer than binary ones and thus cause bandwidth problems for some
applications. Note that the option to compress the archive after
writing works a) only if you serialize into files (which is only one
use) and b) does not address the bandwidth problem of first writing the
large text files.
>
>> 2.) The second problem is speed when serializing large containers of
>> basic data types, e.g. a vector<double> or ublas vectors and matrices.
>> In my applications these can easily by hundreds of megabyte in size.
>> In
>> the current implementation, serializing a
>> std::vector<double>(10000000)
>> requires ten million virtual function calls. In order to prevent this,
>> I propose to add extra virtual functions (like the operator<< above),
>> which serialize C-arrays of basic data tyes, i.e. functions like
>
> Serialization version 6 which was submitted for review includes
> serialization of C-arrays. It is documented in the reference
> under the title "Serialization Implementations included in the Library"
> and a test case was added to test.cpp.

Yes, but it does so by calling the virtual operator << for each
element, which is very slow if you
call it millions of times.

>
>> In conjunction with this, the serialization for std::vector and for
>> ublas vectors, etc. has to be adapted to make use of these optimized
>> serialization functions for basic data types.
>
> The library permits override of the included implementations.
> Of course, this has to be up to the person who finds the
> the included implementation inconvenient in some way as he is
> the only one who knows what he wants changed.

That will not work since overriding is a compile-time decision while I
decide the archive format at runtime and thus need to have these
optimized functions available as virtual functions.

>
>> the serialization of very large numbers of small objects. The current
>> library shows a way to optimize this (in reference.html#large), but it
>> is rather cumbersome. As it is now, I have to reimplement the
>> serialization of std::vector<T>, or std::list<T>, etc., for all such
>> types T. In almost all of my codes I have a large number of small
>> objects of various types for which I know that I will never serialize
>> a
>> pointer. I would thus propose the following:
>
>> i) add a traits class to specify whether ever a pointer to an object
>> will be serialized or if it should be treated as a small object for
>> which serialization should be optimized
>
>> ii) specialize the serialization of the standard library containers
>> for
>> these small objects, using the mechanism in the documentation.
>
>> That way I just need to specify a trait for my object and it will be
>> serialized efficiently
>
> I would be loath to implement this idea. Basically, instead of
> overloading
> the serializations that you want to speed up, you want to require
> all of us to specify traites for every class we want to serialize.

No, for the user who does not care about it nothing must be changed in
his code at all!

Wwe can have a general template that defaults to the full and
non-optimized serialization method for all classes for which we have
not specialized it. That means no extra codes for the standard user,
while the user who needs to optimize large collections of small objects
would just provide a traits class, instead or reimplementing the
serialization of all the standard containers for all his classes that
need to be optimized. An example could be:

template <class T>
struct serialization_traits {
  static const bool optimize_serialization=false;
};

Thus the traits class is written for all classes that do not need to be
optimized. Only for the classes that I need to optimize I would need to
just write:

template <> struct serialization_traits<MySmallClass> {
  static const bool optimize_serialization=false;
}

and the operator << would dispatch based on the value of this trait,
somehow like that:
template <class T, class A>
basic_oarchive& operator<<(basic_oarchive& a, const std::vector<T,A>& v)
{
   return
dispatch_serialization<serialization_traits::optimize_serialization>::se
rialize(a,v);
}

> It would
> make things harder to use. Also, the current implementation - like
> much
> boost code - stretches current compilers to the breaking point. Its
> already much more complex to implement than I expected and
> I already have much difficulty accomdating all the differences
> in C++ implementations.

This would keep things as easy to use, no extra coding is required for
those who do not care about the optimization, but life would be MUCH
easier for other users. If you like, I can try to find time to
implement this in the library. Also, since use just simple template
specialization no modern compiler should have more problems than it
already has with the library.

> Java has runtime reflexion which is used to
> obtain all the information required for serialization. Also,
> Java has a much more limited usage of pointers so certain
> problems we are dealing with don't come up. I don't believe
> that all the data structures can be unambiguously mapped
> to java.

Could Java data structures be mapped to C++ then, to be able to read
Java serialized files? But probably this is then not the scope of this
library anyways but might be interesting as a later extension.

>> I would like to see a platform-independent binary archive format (e.g.
>> using XDR), but am also willing to contribute that myself once the
>> interface has been finalized.
>
> Thank you. Note that none of the comments made so far have any
> impact on the interfaces defined by the base classes
> basic_[i|o]archive,

except that I prefer int16_t, int32_t, ... instead of short and long

> So there is no reason you can't get started now. As you can see
> from the 3 derivations included in the package, making your own
> XDRarchive is a pretty simple proposition.

I'll do that once I get the library to compile, and will send it to you.

>> As was already remarked by others, I would like to see documentation
>> on
>> exactly which functions a new archive type has to implement.
>
> Wouldn't be easier just to look at the basic_[i|o]archive code?

I like the documentation to be self-contained. A documentation page
including a synposis of asic_[i|o]archive, and showing which functions
to implement would make it easier then scanning through the header
file, past all the pragmas, comments and other classes until one finds
the class definition.

> Perhaps
> we might want to break out text_archive an native binary archive
> into separate headers. That might make it more obvious that
> these derivations aren't really part of the library but rather more
> like popular examples.

That makes sense.

I thank you for your effort in replying in such a detailed manner to my
comments and want to quickly summarize the open issues:

i) as you already use int64_t for 64-bit integers why not also use
int32_t, int16_t, etc? That would make it more consistent and MUCH
easier to implement portable binary formats!

ii) support for optimization of serialization by a traits class would
be extremely important and helpful, without incurring any extra coding
efforts for standard users!

iii) additional virtual functions to serialize large arrays of data
(e.g. dense vectors and matrices) instead of calling operator << for
each of the (possible millions of) elements is still needed for
optimization and to make use of corresponding functions for some binary
serialization formats (e.g. in XDR or for PVM)

I would volunteer to implement ii) and iii) into the library if you
agree and do not want to do it yourself.

With best regards,

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk