|
Boost Users : |
Subject: Re: [Boost-users] [MPI, serialization] Segmentation fault in heterogeneous cluster
From: Pfligersdorffer, Christian (Christian.Pfligersdorffer_at_[hidden])
Date: 2010-09-02 03:12:50
Hi Francesco!
Binary archives are for use on one single platform only. If you want to
move archives between different platforms, you have to use something
portable - like xml or text archives. I guess x64 and ppc64 have
different endianess and your compilers might have different type sizes
for int as well.
You can also have a look at my portable binary archive which you can
find at the boost vault. Let me know if you do and find that it works in
your case.
Greetings,
-- Christian Pfligersdorffer Software Engineering http://www.eos.info boost-users-bounces_at_[hidden] on : > Hello, > > I'm getting a segfault when using Boost.MPI on a cluster of > heterogeneous machines (x86_64 and ppc64). The problem arises > when the "slave" machine, ppc64, receives its payload from > the "master" > machine, x86_64, and tries to unpack the archive. Tracing > down the issue with valgrind and in debug mode, the problem > arises here: > > ==28632== Invalid write of size 8 > ==28632== at 0x10429DDC: > boost::archive::detail::basic_iarchive_impl::load_pointer(boos > t::archive::detail::basic_iarchive&, > void*&, boost::archive::detail::basic_pointer_iserializer > const*, boost::archive::detail::basic_pointer_iserializer > const* (*)(boost::serialization::extended_type_info const&)) > (basic_iarchive.cpp:453) ==28632== by 0x1042772F: > boost::archive::detail::basic_iarchive::load_pointer(void*&, > boost::archive::detail::basic_pointer_iserializer const*, > boost::archive::detail::basic_pointer_iserializer const* > (*)(boost::serialization::extended_type_info const&)) > (basic_iarchive.cpp:564) ==28632== by 0x10468707: void > boost::archive::detail::load_pointer_type<boost::mpi::packed_i > archive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&, > pagmo::population*&) (iserializer.hpp:518) > ==28632== by 0x104683EF: void > boost::archive::load<boost::mpi::packed_iarchive, > pagmo::population*>(boost::mpi::packed_iarchive&, > pagmo::population*&) (iserializer.hpp:586) ==28632== by > 0x10468223: void > boost::archive::detail::common_iarchive<boost::mpi::packed_iar > chive>::load_override<pagmo::population*>(pagmo::population*&, > int) (common_iarchive.hpp:68) > ==28632== by 0x10468023: void > boost::archive::basic_binary_iarchive<boost::mpi::packed_iarch > ive>::load_override<pagmo::population*>(pagmo::population*&, > int) (basic_binary_iarchive.hpp:67) > ==28632== by 0x10467E27: void > boost::mpi::packed_iarchive::load_override<pagmo::population*> > (pagmo::population*&, int, mpl_::bool_<false>) > (packed_iarchive.hpp:98) ==28632== by 0x10467C27: void > boost::mpi::packed_iarchive::load_override<pagmo::population*> > (pagmo::population*&, int) (packed_iarchive.hpp:115) > ==28632== by 0x1046798F: boost::mpi::packed_iarchive& > boost::archive::detail::interface_iarchive<boost::mpi::packed_ > iarchive>::operator>><pagmo::population*>(pagmo::population*&) > (interface_iarchive.hpp:60) ==28632== by 0x104676BB: void > boost::serialization::nvp<pagmo::population*>::load<boost::mpi >>> packed_iarchive>(boost::mpi::packed_iarchive&, > unsigned int) (nvp.hpp:87) > ==28632== by 0x104674AF: void > boost::serialization::access::member_load<boost::mpi::packed_iarchive, > boost::serialization::nvp<pagmo::population*> >> (boost::mpi::packed_iarchive&, > boost::serialization::nvp<pagmo::population*>&, unsigned int) > (access.hpp:101) ==28632== by 0x104672CF: > boost::serialization::detail::member_loader<boost::mpi::packed > _iarchive, boost::serialization::nvp<pagmo::population*> >>>> invoke(boost::mpi::packed_iarchive&, > boost::serialization::nvp<pagmo::population*>&, unsigned int) > (split_member.hpp:54) ==28632== Address 0x4b65d98 is not stack'd, > malloc'd or (recently) free'd > > The issue is in the method basic_iarchive_impl::load_pointer, > around line 450: > > int i = cid; > cobject_id_vector[i].bpis_ptr = bpis_ptr; > > Indeed, a printf confirms that i == 512 while > cobject_id_vector.size() == 3. This also provokes the > assertion new_cid == cid to fail one line below (where > new_cid == 2). The same code, run locally on the ppc64 acting > both as slave and master with mpirun -np 2, runs ok. Boost > version is 1.42.0, MPI implementation is openMPI 1.4.2. > > Can this be related to some endianness issue? Is Boost.MPI > expected to work on heterogeneous clusters? > > Thanks, > > Francesco. > _______________________________________________ > Boost-users mailing list > Boost-users_at_[hidden] > http://lists.boost.org/mailman/listinfo.cgi/boost-users
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net