
Hello, I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here: ==28632== Invalid write of size 8 ==28632== at 0x10429DDC: boost::archive::detail::basic_iarchive_impl::load_pointer(boost::archive::detail::basic_iarchive&, void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:453) ==28632== by 0x1042772F: boost::archive::detail::basic_iarchive::load_pointer(void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:564) ==28632== by 0x10468707: void boost::archive::detail::load_pointer_type<boost::mpi::packed_iarchive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:518) ==28632== by 0x104683EF: void boost::archive::load<boost::mpi::packed_iarchive, pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:586) ==28632== by 0x10468223: void boost::archive::detail::common_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&, int) (common_iarchive.hpp:68) ==28632== by 0x10468023: void boost::archive::basic_binary_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&, int) (basic_binary_iarchive.hpp:67) ==28632== by 0x10467E27: void boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&, int, mpl_::bool_<false>) (packed_iarchive.hpp:98) ==28632== by 0x10467C27: void boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&, int) (packed_iarchive.hpp:115) ==28632== by 0x1046798F: boost::mpi::packed_iarchive& boost::archive::detail::interface_iarchive<boost::mpi::packed_iarchive>::operator>><pagmo::population*>(pagmo::population*&) (interface_iarchive.hpp:60) ==28632== by 0x104676BB: void boost::serialization::nvp<pagmo::population*>::load<boost::mpi::packed_iarchive>(boost::mpi::packed_iarchive&, unsigned int) (nvp.hpp:87) ==28632== by 0x104674AF: void boost::serialization::access::member_load<boost::mpi::packed_iarchive, boost::serialization::nvp<pagmo::population*>
(boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (access.hpp:101) ==28632== by 0x104672CF: boost::serialization::detail::member_loader<boost::mpi::packed_iarchive, boost::serialization::nvp<pagmo::population*> ::invoke(boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (split_member.hpp:54) ==28632== Address 0x4b65d98 is not stack'd, malloc'd or (recently) free'd
The issue is in the method basic_iarchive_impl::load_pointer, around line 450: int i = cid; cobject_id_vector[i].bpis_ptr = bpis_ptr; Indeed, a printf confirms that i == 512 while cobject_id_vector.size() == 3. This also provokes the assertion new_cid == cid to fail one line below (where new_cid == 2). The same code, run locally on the ppc64 acting both as slave and master with mpirun -np 2, runs ok. Boost version is 1.42.0, MPI implementation is openMPI 1.4.2. Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters? Thanks, Francesco.