Boost logo

Boost Users :

Subject: Re: [Boost-users] [MPI, serialization] Segmentation fault in heterogeneous cluster
From: Pfligersdorffer, Christian (Christian.Pfligersdorffer_at_[hidden])
Date: 2010-09-02 03:12:50


Hi Francesco!

Binary archives are for use on one single platform only. If you want to
move archives between different platforms, you have to use something
portable - like xml or text archives. I guess x64 and ppc64 have
different endianess and your compilers might have different type sizes
for int as well.

You can also have a look at my portable binary archive which you can
find at the boost vault. Let me know if you do and find that it works in
your case.

Greetings,

--
Christian Pfligersdorffer
Software Engineering
http://www.eos.info
 
boost-users-bounces_at_[hidden] on :
> Hello,
> 
> I'm getting a segfault when using Boost.MPI on a cluster of
> heterogeneous machines (x86_64 and ppc64). The problem arises
> when the "slave" machine, ppc64, receives its payload from
> the "master"
> machine, x86_64, and tries to unpack the archive. Tracing
> down the issue with valgrind and in debug mode, the problem
> arises here:
> 
> ==28632== Invalid write of size 8
> ==28632==    at 0x10429DDC:
> boost::archive::detail::basic_iarchive_impl::load_pointer(boos
> t::archive::detail::basic_iarchive&,
> void*&, boost::archive::detail::basic_pointer_iserializer
> const*, boost::archive::detail::basic_pointer_iserializer
> const* (*)(boost::serialization::extended_type_info const&))
> (basic_iarchive.cpp:453) ==28632==    by 0x1042772F:
> boost::archive::detail::basic_iarchive::load_pointer(void*&,
> boost::archive::detail::basic_pointer_iserializer const*,
> boost::archive::detail::basic_pointer_iserializer const*
> (*)(boost::serialization::extended_type_info const&))
> (basic_iarchive.cpp:564) ==28632==    by 0x10468707: void
> boost::archive::detail::load_pointer_type<boost::mpi::packed_i
> archive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&,
> pagmo::population*&) (iserializer.hpp:518)
> ==28632==    by 0x104683EF: void
> boost::archive::load<boost::mpi::packed_iarchive,
> pagmo::population*>(boost::mpi::packed_iarchive&,
> pagmo::population*&) (iserializer.hpp:586) ==28632==    by
> 0x10468223: void
> boost::archive::detail::common_iarchive<boost::mpi::packed_iar
> chive>::load_override<pagmo::population*>(pagmo::population*&, 
> int) (common_iarchive.hpp:68)
> ==28632==    by 0x10468023: void
> boost::archive::basic_binary_iarchive<boost::mpi::packed_iarch
> ive>::load_override<pagmo::population*>(pagmo::population*&,
> int) (basic_binary_iarchive.hpp:67)
> ==28632==    by 0x10467E27: void
> boost::mpi::packed_iarchive::load_override<pagmo::population*>
> (pagmo::population*&, int, mpl_::bool_<false>)
> (packed_iarchive.hpp:98) ==28632==    by 0x10467C27: void
> boost::mpi::packed_iarchive::load_override<pagmo::population*>
> (pagmo::population*&, int) (packed_iarchive.hpp:115)
> ==28632==    by 0x1046798F: boost::mpi::packed_iarchive&
> boost::archive::detail::interface_iarchive<boost::mpi::packed_
> iarchive>::operator>><pagmo::population*>(pagmo::population*&)
> (interface_iarchive.hpp:60) ==28632==    by 0x104676BB: void
> boost::serialization::nvp<pagmo::population*>::load<boost::mpi
>>> packed_iarchive>(boost::mpi::packed_iarchive&,
> unsigned int) (nvp.hpp:87)
> ==28632==    by 0x104674AF: void
> boost::serialization::access::member_load<boost::mpi::packed_iarchive,
> boost::serialization::nvp<pagmo::population*>
>> (boost::mpi::packed_iarchive&,
> boost::serialization::nvp<pagmo::population*>&, unsigned int)
> (access.hpp:101) ==28632==    by 0x104672CF:
> boost::serialization::detail::member_loader<boost::mpi::packed
> _iarchive, boost::serialization::nvp<pagmo::population*>
>>>> invoke(boost::mpi::packed_iarchive&,
> boost::serialization::nvp<pagmo::population*>&, unsigned int)
> (split_member.hpp:54) ==28632==  Address 0x4b65d98 is not stack'd,
> malloc'd or (recently) free'd 
> 
> The issue is in the method basic_iarchive_impl::load_pointer,
> around line 450:
> 
> int i = cid;
> cobject_id_vector[i].bpis_ptr = bpis_ptr;
> 
> Indeed, a printf confirms that i == 512 while
> cobject_id_vector.size() == 3. This also provokes the
> assertion new_cid == cid to fail one line below (where
> new_cid == 2). The same code, run locally on the ppc64 acting
> both as slave and master with mpirun -np 2, runs ok. Boost
> version is 1.42.0, MPI implementation is openMPI 1.4.2.
> 
> Can this be related to some endianness issue? Is Boost.MPI
> expected to work on heterogeneous clusters?
> 
> Thanks,
> 
>   Francesco.
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net