Boost logo

Boost Users :

Subject: [Boost-users] [MPI, serialization] Segmentation fault in heterogeneous cluster
From: Francesco Biscani (bluescarni_at_[hidden])
Date: 2010-09-01 19:39:58


Hello,

I'm getting a segfault when using Boost.MPI on a cluster of
heterogeneous machines (x86_64 and ppc64). The problem arises when the
"slave" machine, ppc64, receives its payload from the "master"
machine, x86_64, and tries to unpack the archive. Tracing down the
issue with valgrind and in debug mode, the problem arises here:

==28632== Invalid write of size 8
==28632== at 0x10429DDC:
boost::archive::detail::basic_iarchive_impl::load_pointer(boost::archive::detail::basic_iarchive&,
void*&, boost::archive::detail::basic_pointer_iserializer const*,
boost::archive::detail::basic_pointer_iserializer const*
(*)(boost::serialization::extended_type_info const&))
(basic_iarchive.cpp:453)
==28632== by 0x1042772F:
boost::archive::detail::basic_iarchive::load_pointer(void*&,
boost::archive::detail::basic_pointer_iserializer const*,
boost::archive::detail::basic_pointer_iserializer const*
(*)(boost::serialization::extended_type_info const&))
(basic_iarchive.cpp:564)
==28632== by 0x10468707: void
boost::archive::detail::load_pointer_type<boost::mpi::packed_iarchive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&,
pagmo::population*&) (iserializer.hpp:518)
==28632== by 0x104683EF: void
boost::archive::load<boost::mpi::packed_iarchive,
pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&)
(iserializer.hpp:586)
==28632== by 0x10468223: void
boost::archive::detail::common_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&,
int) (common_iarchive.hpp:68)
==28632== by 0x10468023: void
boost::archive::basic_binary_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&,
int) (basic_binary_iarchive.hpp:67)
==28632== by 0x10467E27: void
boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&,
int, mpl_::bool_<false>) (packed_iarchive.hpp:98)
==28632== by 0x10467C27: void
boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&,
int) (packed_iarchive.hpp:115)
==28632== by 0x1046798F: boost::mpi::packed_iarchive&
boost::archive::detail::interface_iarchive<boost::mpi::packed_iarchive>::operator>><pagmo::population*>(pagmo::population*&)
(interface_iarchive.hpp:60)
==28632== by 0x104676BB: void
boost::serialization::nvp<pagmo::population*>::load<boost::mpi::packed_iarchive>(boost::mpi::packed_iarchive&,
unsigned int) (nvp.hpp:87)
==28632== by 0x104674AF: void
boost::serialization::access::member_load<boost::mpi::packed_iarchive,
boost::serialization::nvp<pagmo::population*>
>(boost::mpi::packed_iarchive&,
boost::serialization::nvp<pagmo::population*>&, unsigned int)
(access.hpp:101)
==28632== by 0x104672CF:
boost::serialization::detail::member_loader<boost::mpi::packed_iarchive,
boost::serialization::nvp<pagmo::population*>
>::invoke(boost::mpi::packed_iarchive&,
boost::serialization::nvp<pagmo::population*>&, unsigned int)
(split_member.hpp:54)
==28632== Address 0x4b65d98 is not stack'd, malloc'd or (recently) free'd

The issue is in the method basic_iarchive_impl::load_pointer, around line 450:

int i = cid;
cobject_id_vector[i].bpis_ptr = bpis_ptr;

Indeed, a printf confirms that i == 512 while cobject_id_vector.size()
== 3. This also provokes the assertion new_cid == cid to fail one line
below (where new_cid == 2). The same code, run locally on the ppc64
acting both as slave and master with mpirun -np 2, runs ok. Boost
version is 1.42.0, MPI implementation is openMPI 1.4.2.

Can this be related to some endianness issue? Is Boost.MPI expected to
work on heterogeneous clusters?

Thanks,

  Francesco.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net