I've re-run your benchmarks (see attachments) using:
mpicxx -O3 -march=native -mtune=native -fstrict-aliasing -fomit-frame-pointer -pipe -ffast-math -std=c++1y -pedantic -Wall vector_send.cpp; mpirun -np 2 ./a.out
The compiler is clang tip-of-trunk (3.5 svn 21 january), libc++ is also tip-of-trunk, and boost is 1.55. I've increased the #of iterations per vector size to 1000.
First of all,
- MPI measurements in my system have an uncertainty similar to the difference between vector and array without allocations.
- Memory allocation measurements in my system have an uncertainty similar the difference between vector and array with allocations +- the MPI measurement uncertainty.
Now a lot of guessing follows. When you use comm.recv(1,0,vector) I guess that the vector is serialized and send as follows:
- process 1:
- allocates a buffer to serialize the vector
- copies the vector into the buffer
- sends the size of the buffer to 0
- send the buffer to 0
- process 0:
- receives the size of the buffer
- allocates a buffer to receive the vector
- receives the vector
- copies the vector data to the original vector (which might incur multiple memory allocations)
I've tried to replicate this behavior in the "vector_inefficient" benchmark which seems to agree well with the "vector" benchmark.
Can anyone with more knowledge of the Boost.MPI internals either confirm this or explain what is really happening?
My best guess is that the performance problem comes from relying in the generic boost::serialization for sending/receiving std::vector's. It is just that for std::vector we can do the send with the pair (data(),size) without allocating an extra buffer. The receive can be done with resize(size) and using data() as receive buffer. I guess that the skeletons are doing this.
Note 1: none of the above would be a problem if we were sending a std::map or any other dynamic data structure.
Note 2: vector.resize default constructs the elements while malloc does not (new does). That is, even in the best case sending a std::vector with MPI is still more expensive than sending a plain C array although for cheap to default construct types the difference should be minimal.
Don't know what is the best way to improve this. Boost.Serialization does not seem to be the right place to do this optimization. Maybe overloading send and recv methods for std and boost vectors will work.
Bests,
Gonzalo