Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2005-10-12 07:17:06


On Oct 11, 2005, at 6:45 PM, Robert Ramey wrote:

> Matthias Troyer wrote:
>>
>> There are actually only very few modifications:
>>
>> 1. boost/archive/detail/oserializer.hpp and iserializer.hpp require
>> modifications for the serialization of C-arrays of fixed length. In
>> my version, the class save_array_type is modified to dispatch to
>> save_array when fast_array_serialization is possible. The underlying
>> problem here is that oserializer.hpp implements the serialization of
>> a type here (the C array!). The optimal solution to this problem
>> would be to move the array serialization to a separate header boost/
>> serialization/array.hpp, as is done for all C++ classes.
>>
>
> My intention was to include all types "built-in" to the C++
> language in
> ?serializer.hpp
> so that's why it includes C++ arrays. Separating this into another
> header
> would
> break this "concept" and would result in a small additional *.hpp
> file that
> would
> have to be included explicitly. So I would be against it in this
> case. On
> the
> other hand, I do appreciate ideas that remove pieces from the
> "core" of the
> library
> and turn them into more independent modules so don't get
> discouraged here.

Actually the way I see it you deal with the serialization of
pointers, the class versioning and object tracking there. All of
these get serialized through special types, created by strong
typedefs, and the archives can then override the serialization of
these classes. Please correct me if I'm wrong, but the built-in types
like int, double, etc.. are actually all dispatched to the archive,
and not serialized by the [io]serializer. The only exception seem to
be pointers (which should be handled by the serialization library),
and C-arrays. It would thus make sense to put the serialization of
arrays into a separate header, just as you have done for std::vector,
and as I will do soon for other classes.

However there are also other options as you pointed out: the archive
classes could override the serialization of arrays. As long as this
stay limited to arrays there will be no MxN scaling problem, but
there is still a problem of code duplication, since each archive type
implementing fast array serialization has to override the
serialization of arrays. This is also error-prone since we have to
tell the implementors of archives supporting fast array serialization
that they should not forget overriding serialization of built-in arrays.

>> 2. boost/serialization/vector.hpp is also modified to dispatch to
>> save_array and load_array where possible. I don't think that this is
>> a problem?
>>
>
> That would mean that users are including save/load_array even if they
> don't want them or want to use their own versions. Oh - then
> documentation
> has to be enhanced to explain all this internal behavior.

Actually the cost is minimal if the archive does not support fast
save/load_array. The hasfast_array_serialization.hpp header only
consists of the default traits:

template <class Archive, class Type>
struct has_fast_array_serialization : public mpl::bool_<false> {};

and the serialization of a std:vector only contains this minimal
extension:

template<class Archive, class U, class Allocator>
inline void save(Archive & ar,const STD::vector<U, Allocator> &t,
const unsigned int ,
     typename
boost::enable_if<boost::archive::has_fast_array_serialization<Archive,U>
>::type* =0
){
     const boost::archive::container_size_type count(t.size());
     ar << BOOST_SERIALIZATION_NVP(count);
     if (count)
       ar.save_array(boost::detail::get_data(t),t.size());
}

  The cost of parsing these few lines is negligible compared to the
rest of the serialization library.

> I would prefer
> something like the following:
>
> class my_class {
> stl::vector<int> m_vi;
> ...
> };
> template<class Archive>
> my_class::serialize(Archive &ar, const unsigned int version){
> // standard way
> ar & m_vi;
> // or fast way which defaults to standard way in appropriate
> cases.
> save_array(ar, m_vi);
> }
>
> This
> a) keeps the stl portion of the library smaller
> b) leave the user in control of what's going on
> c) permits development of save/load array to be on an independent
> parallel track
> with everyting else.

I find this proposal unacceptable for the following reasons:

- it breaks the orthogonality between serialization and archives

- how the array representation of the vector gets serialized should
be a concern of the archive and not the user

- the user has to remember to always call save_array(ar,m_vi) instead
of just serializing the vector directly. This is quite error-prone
and will easily lead to sub-optimal code.

- the user has to know for which classes to call save_array and for
which it will not be needed. For vector it might be intuitive, but
what about ublas matrix types: do you know which ublas matrix types
can use fast array serialization and which ones cannot? Or, even
worse, if the matrix type is a template parameter you have a worse
problem.

Now to address your issues:

a) keeping the STL portion small: I don't see this as a valid point
since, as you can see above it increases the size of the STL
serialization code only by a few lines.

b) "leave the user in control of what's going on": actually this is
breaking the orthogonality. The user should not influence the
archives internal behavior. The archive class should decide how to
serialize the array, not the user. The user can pick between fast and
slow array serialization by choosing a different archive class.

c) development on an independent track: the only interference we have
is this one file vector.hpp.

>> All the other changes were to modify the binary archives and the
>> polymorphic archive to support fast array serialization. In contrast
>> to the above points this is optional. Instead we could provide
>> fast_binary_[io]archive and fast_polymporphic_[io]archive, that
>> differ from their normal versions just by supporting fast array
>> serialization. I could live with this as well, although it makes more
>> sense in my opinion to just addd the save_array/load_array features
>> to the existing archives.
>
> sounds like you can go along with me on this.

I'll do that but I still think that it is the wrong way to go for the
binary archives. The only difference
are these five lines in basic_binary_iprimitive.hpp

     template<class T>
     void load_array(T *address, std::size_t count)
     {
         load_binary(address, count*sizeof(T));
     }

but they give you a factor of 10 in performance! Why should anyone
still use the other version? To save the compile time for 5 lines of
code?

> The other part of your proposal really is an enhancement of the stl
> serializations
> which are not part of the "core" part of the library. I'm
> confident that
> your
> enhancement will be quite useful to many people and I'm very happy
> to see
> that
> you've done it. That's not the same as forcing it on all users of the
> library. So
> I prefer to see your enhancement as an option explicitly invoked by
> the
> user.
> This has a number of advantages:
>
> a) the user can see what he is doing. No hidden complex behavior.

the fast array serialization is an implementation detail of the
archive, and need not be visible to the user at all. Note that none
of the archive modifications I suggested change anything that is
visible to the user, aside from a speedup in execution time. The
syntax and semantics of the serialization is unchanged, as is the
archive itself. Only the process of serialization is faster.

> b) some users might want just minimal code since thier collections
> are small

we're talking about a few lines of code that can improve performance
by a factor of 10 in a library that is many thousands of lines. Is
that really an issue?

> c) your enhancement will require significant documentation. This
> will be
> much
> easier if it is an optional add-on to the serialization library.

It requires one or two pages of documentation for archive developers,
and none for users unless they implement serialization of array-like
containers.

> So to summarize the issues
>
> a) should C++ array serialization be in a separate header? I say
> no, you say
> yes.
>
>
> b) should save/load array be incorporated into stl collection
> serialization
> to make
> its usage oblicatory? I say no, you say yes.

This point b) is where I will not budge, for the reasons explained in
earlier e-mails. While I could maybe live with the fact that I have
to override the C-style array serialization in all archives
supporting fast array serialization, I will never do that for other
classes, since this again opens the can of worms discussed
previously. Let me outline it again:

if the vector.hpp serialization stays unchanged, I will have to
override it in the archive.

Next we'll implement the std::valarray serialization. What should we
do? Support fast array serialization out of the box or leave it to
the archive implementor to override. We'll probably follow the
example of std::vector and do not support it. Now the archive also
has to provide overrides for std::valarray, which can still be done.

After that we'll implement serialization of ublas matrices. Following
the above examples we will again not implement support for fast array
serialization directly, to save a few lines of code. The consequence
is even worse now: the archive implementation has to override the
serialization of all ublas matrices, and will either be inefficient,
or has to have knowledge of implementation details of the ublas
matrices.

We would be back at both an MxN problem, and will have tight coupling
between archives and the implementation details of the classes to be
serialized. We should avoid this at all cost!

So the real question here is:

"Shall we recommend that the serialization of array-like data
structures uses fast array serialization by calling save/load_array
when possible?"

My clear answer is yes, and I will not budge on that. The
serialization library is useless to me with a 10x performance hit.
And many people I talked to do not use Boost.Serialization but their
own (otherwise inferior) solutions for that reason. I just want to
mention that vectors with billions elements are typical sizes for
many of our problems.

The real question is where to draw the line between using fast array
serialization and not using it?

- I think we can agree that classes like multi_array or ublas
matrices and vectors should be recommended to use it wherever possible
- The same should be true for std::valarray.
- To stay consistent we should then also use it for std::vector
- What about C-arrays? since I rarely actually use them in raw form
in my code, and never for large sizes, I have no strong personal
preference. It would just be consistent and speed up serialzation at
negligible compile time cost to also use the fast option there, but
if you veto it I could live with it.

>
> Regardless of how these questions are answered, its clear to me
> that your
> enhancements to stl containers will be available to users.
> Assuming that
> this
> is packaged as an optional add-on as I would hope, the only questions
> remaining will
> be:
>
> a) Should this be a totally separate library with its own
> documentation/tests/directory tree etc?
> It should be separate but not totally so:
>
> Maybe a files or maybe directory within serialization for save/load
> array
> and within archive
> for fast...archive adaptor.
>
> A group of its own tests - just like we have tests for all other
> combinations of serializations
> and archives - I can hear the howling already. We'll have to see
> what to do
> about this.
>
> A separate documenation section in the documenation of the
> serialization
> library. Similar to the
> miscelleneas. But miscellaneas holds things that are really
> separate so
> we'll find a good place for it. Maybe a section titled something like
> "Special Considerations When Serializing Collections" (but shorter).
>
> Note that your fast...archive will really be a new creature - an
> "archive
> adaptor". This merits
> an new section in the documentation in any case. This is a cool
> and useful
> technique
> and will encourage future great ideas which enhance the serialization
> library without
> making the core bigger.

Actually it might be an adaptor only in the case of the binary
archive. Other archives I mentioned (such as the MPI archive) will
have to support fast array serialization directly to have any chance
of being usable.

> b) Should such an optional enhancement be subject to some sort of
> review?
> I'm agonostic
> on this. I would be happy to just be the gate keeper and accept it
> or make
> you alter it to my taste.

I have no strong opinion on this. For now I need a mechanism for fast
array serialization in place to be able to implement serialization of
large arrays and matrices. Optimized archives could come later and
could go through some kind of review, e.g. an MPI archive together
with a possible future parallel programming library.

Regarding the binary archives: if your concern is that it will make
it harder for you to maintain, then I could, if you want, propose to
submit the fast array version as a replacement for the existing one
and take over its maintenance. That will make your task smaller and
in the review of the submission we can hear if someone wants to keep
the existing version.

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk