[MPI, serialization] Segmentation fault in heterogeneous cluster

Hello, I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here: ==28632== Invalid write of size 8 ==28632== at 0x10429DDC: boost::archive::detail::basic_iarchive_impl::load_pointer(boost::archive::detail::basic_iarchive&, void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:453) ==28632== by 0x1042772F: boost::archive::detail::basic_iarchive::load_pointer(void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:564) ==28632== by 0x10468707: void boost::archive::detail::load_pointer_type<boost::mpi::packed_iarchive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:518) ==28632== by 0x104683EF: void boost::archive::load<boost::mpi::packed_iarchive, pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:586) ==28632== by 0x10468223: void boost::archive::detail::common_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&, int) (common_iarchive.hpp:68) ==28632== by 0x10468023: void boost::archive::basic_binary_iarchive<boost::mpi::packed_iarchive>::load_override<pagmo::population*>(pagmo::population*&, int) (basic_binary_iarchive.hpp:67) ==28632== by 0x10467E27: void boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&, int, mpl_::bool_<false>) (packed_iarchive.hpp:98) ==28632== by 0x10467C27: void boost::mpi::packed_iarchive::load_override<pagmo::population*>(pagmo::population*&, int) (packed_iarchive.hpp:115) ==28632== by 0x1046798F: boost::mpi::packed_iarchive& boost::archive::detail::interface_iarchive<boost::mpi::packed_iarchive>::operator>><pagmo::population*>(pagmo::population*&) (interface_iarchive.hpp:60) ==28632== by 0x104676BB: void boost::serialization::nvp<pagmo::population*>::load<boost::mpi::packed_iarchive>(boost::mpi::packed_iarchive&, unsigned int) (nvp.hpp:87) ==28632== by 0x104674AF: void boost::serialization::access::member_load<boost::mpi::packed_iarchive, boost::serialization::nvp<pagmo::population*>
(boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (access.hpp:101) ==28632== by 0x104672CF: boost::serialization::detail::member_loader<boost::mpi::packed_iarchive, boost::serialization::nvp<pagmo::population*> ::invoke(boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (split_member.hpp:54) ==28632== Address 0x4b65d98 is not stack'd, malloc'd or (recently) free'd
The issue is in the method basic_iarchive_impl::load_pointer, around line 450: int i = cid; cobject_id_vector[i].bpis_ptr = bpis_ptr; Indeed, a printf confirms that i == 512 while cobject_id_vector.size() == 3. This also provokes the assertion new_cid == cid to fail one line below (where new_cid == 2). The same code, run locally on the ppc64 acting both as slave and master with mpirun -np 2, runs ok. Boost version is 1.42.0, MPI implementation is openMPI 1.4.2. Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters? Thanks, Francesco.

On Sep 2, 2010, at 7:39, Francesco Biscani <bluescarni@gmail.com> wrote:
Hello,
I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here:
Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters?
Hi Francesco, Have you checked whether a program using the MPI C API can correctly send data on your heterogeneous cluster? Boost.MPI uses the support for heterogeneous machines of the underlying MPI library unless you define the macro BOOST_MPI_HOMOGENOUS. Have you also tried the latest Boost release? Matthias

Hi Matthias, I updated to Boost 1.44.0 but unfortunately the crash is now even in local mode (mpirun -np 2). The strange thing is that the serialization code is apparently working fine when used with text archives, but with MPI archives the slave process, upon reception, is deserializing the objects with seemingly random values (e.g., huge values instead of 1 or 0 for an integer data member of a structure). I'm trying to isolate the problem right now and, in case I can reproduce it with a minimal example, I will post it here (though it is likely some mistake on my part, it's the first time I use MPI and serialization libraries). Cheers, Francesco On Thu, Sep 2, 2010 at 4:26 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
On Sep 2, 2010, at 7:39, Francesco Biscani <bluescarni@gmail.com> wrote:
Hello,
I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here:
Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters?
Hi Francesco,
Have you checked whether a program using the MPI C API can correctly send data on your heterogeneous cluster? Boost.MPI uses the support for heterogeneous machines of the underlying MPI library unless you define the macro BOOST_MPI_HOMOGENOUS.
Have you also tried the latest Boost release?
Matthias _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 3 Sep 2010, at 17:31, Francesco Biscani wrote:
Hi Matthias,
I updated to Boost 1.44.0 but unfortunately the crash is now even in local mode (mpirun -np 2). The strange thing is that the serialization code is apparently working fine when used with text archives, but with MPI archives the slave process, upon reception, is deserializing the objects with seemingly random values (e.g., huge values instead of 1 or 0 for an integer data member of a structure).
I'm trying to isolate the problem right now and, in case I can reproduce it with a minimal example, I will post it here (though it is likely some mistake on my part, it's the first time I use MPI and serialization libraries).
Hi Francesco Have you tried it with binary archives? Matthias

Hi Matthias, I'm gonna try right now. Just as an update, the problem seems to go away if I serialize the payload in a text archive, convert it to string, and send the string instead of the archive. Cheers, Francesco. On Fri, Sep 3, 2010 at 1:30 PM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
On 3 Sep 2010, at 17:31, Francesco Biscani wrote:
Hi Matthias,
I updated to Boost 1.44.0 but unfortunately the crash is now even in local mode (mpirun -np 2). The strange thing is that the serialization code is apparently working fine when used with text archives, but with MPI archives the slave process, upon reception, is deserializing the objects with seemingly random values (e.g., huge values instead of 1 or 0 for an integer data member of a structure).
I'm trying to isolate the problem right now and, in case I can reproduce it with a minimal example, I will post it here (though it is likely some mistake on my part, it's the first time I use MPI and serialization libraries).
Hi Francesco
Have you tried it with binary archives?
Matthias
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Simple serialization into binary archives seems to be ok (no crash and valgrind clean). I changed slightly some implementation details and now the error I get is an "MPI message truncated": sometimes it crashes with this message, other times it seems to hang while eating more and more RAM as time passes. I'm going to try mpich2 and see if it makes any difference. Cheers, Francesco. On Fri, Sep 3, 2010 at 1:56 PM, Francesco Biscani <bluescarni@gmail.com> wrote:
Hi Matthias,
I'm gonna try right now. Just as an update, the problem seems to go away if I serialize the payload in a text archive, convert it to string, and send the string instead of the archive.
Cheers,
Francesco.
On Fri, Sep 3, 2010 at 1:30 PM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
On 3 Sep 2010, at 17:31, Francesco Biscani wrote:
Hi Matthias,
I updated to Boost 1.44.0 but unfortunately the crash is now even in local mode (mpirun -np 2). The strange thing is that the serialization code is apparently working fine when used with text archives, but with MPI archives the slave process, upon reception, is deserializing the objects with seemingly random values (e.g., huge values instead of 1 or 0 for an integer data member of a structure).
I'm trying to isolate the problem right now and, in case I can reproduce it with a minimal example, I will post it here (though it is likely some mistake on my part, it's the first time I use MPI and serialization libraries).
Hi Francesco
Have you tried it with binary archives?
Matthias
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 3 Sep 2010, at 20:55, Francesco Biscani wrote:
Simple serialization into binary archives seems to be ok (no crash and valgrind clean). I changed slightly some implementation details and now the error I get is an "MPI message truncated": sometimes it crashes with this message, other times it seems to hang while eating more and more RAM as time passes.
I'm going to try mpich2 and see if it makes any difference.
Cheers,
Francesco.
What I'll need is a test case that shows the bug - otherwise I cannot help. Matthias

Hi Matthias, probably I'm doing something really stupid, but it seems the problem is somehow related to shared_ptr. This code reproduces the "MPI message truncated error": #include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <boost/serialization/assume_abstract.hpp> #include <boost/serialization/export.hpp> #include <boost/serialization/base_object.hpp> #include <boost/serialization/shared_ptr.hpp> #include <boost/serialization/tracking.hpp> #include <boost/serialization/vector.hpp> #include <boost/shared_ptr.hpp> #include <vector> struct base { virtual void do_something() const = 0; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & values; } std::vector<double> values; virtual ~base() {} }; BOOST_SERIALIZATION_ASSUME_ABSTRACT(base); struct derived: public base { void do_something() const {}; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & boost::serialization::base_object<base>(*this); } }; BOOST_CLASS_EXPORT(derived); struct container { template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & ptr; } boost::shared_ptr<base> ptr; }; int main() { boost::mpi::environment env; boost::mpi::communicator world; if (world.rank() == 0) { boost::shared_ptr<container> c(new container()); world.send(1,0,c); world.recv(1,0,c); } else { boost::shared_ptr<container> c(new container()); world.recv(0,0,c); world.send(0,0,c); } return 0; } The error happens when rank 1 is receiving the object: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated
Thanks, Francesco.

Just an update, in case anyone is still following this. It turns out that even when serializing the classes to a text archive, converting it to string, transmit the string via boost::mpi and then rebuilding the classes on the other side from the transmitted string, I still have the same error as reported above for heterogeneous clusters (in homogeneous clusters it works seemingly ok). So what I'm doing now is to send the archive in string form using directly the MPI_* primitives (using a std::vector<char> as buffer and MPI_CHAR datatype). This works in all configurations I've tested. I'm not entirely sure if the problem is on my side or if this is a genuine bug, but I would like to provide any info/testing necessary to solve this issue. Thanks again, Francesco. On Fri, Sep 3, 2010 at 6:36 PM, Francesco Biscani <bluescarni@gmail.com> wrote:
Hi Matthias,
probably I'm doing something really stupid, but it seems the problem is somehow related to shared_ptr. This code reproduces the "MPI message truncated error":
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <boost/serialization/assume_abstract.hpp> #include <boost/serialization/export.hpp> #include <boost/serialization/base_object.hpp> #include <boost/serialization/shared_ptr.hpp> #include <boost/serialization/tracking.hpp> #include <boost/serialization/vector.hpp> #include <boost/shared_ptr.hpp> #include <vector>
struct base { virtual void do_something() const = 0; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & values; } std::vector<double> values; virtual ~base() {} };
BOOST_SERIALIZATION_ASSUME_ABSTRACT(base);
struct derived: public base { void do_something() const {}; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & boost::serialization::base_object<base>(*this); } };
BOOST_CLASS_EXPORT(derived);
struct container { template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & ptr; } boost::shared_ptr<base> ptr; };
int main() { boost::mpi::environment env; boost::mpi::communicator world; if (world.rank() == 0) { boost::shared_ptr<container> c(new container()); world.send(1,0,c); world.recv(1,0,c); } else { boost::shared_ptr<container> c(new container()); world.recv(0,0,c); world.send(0,0,c); } return 0; }
The error happens when rank 1 is receiving the object:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated
Thanks,
Francesco.

On 7 Sep 2010, at 22:41, Francesco Biscani wrote:
Just an update, in case anyone is still following this.
It turns out that even when serializing the classes to a text archive, converting it to string, transmit the string via boost::mpi and then rebuilding the classes on the other side from the transmitted string, I still have the same error as reported above for heterogeneous clusters (in homogeneous clusters it works seemingly ok).
So what I'm doing now is to send the archive in string form using directly the MPI_* primitives (using a std::vector<char> as buffer and MPI_CHAR datatype). This works in all configurations I've tested.
I'm not entirely sure if the problem is on my side or if this is a genuine bug, but I would like to provide any info/testing necessary to solve this issue.
Thanks again,
Can you just send me a program that exhibits the problem? Also, did you test whether your MPI library works on the heterogeneous machine when making the MPI_* calls and packing data into a buffer using the MPI_Pack/MPI_Unpack calls? There might be a problem with pack/unpack on your system. Matthias

Hi Matthias, On Wed, Sep 8, 2010 at 4:42 AM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Can you just send me a program that exhibits the problem?
I can reproduce the error with the minimal program attached in an earlier message from this thread. Otherwise, the real code is from this GIT repository in the branch called "mpi": http://pagmo.git.sourceforge.net/git/gitweb.cgi?p=pagmo/pagmo;a=summary The relevant code is in src/mpi_environment.cpp and mpi_island.cpp. The first file implements a class that inits a boost::mpi::environment and, in case of slave nodes, opens up a "daemon" waiting for jobs to execute. The class in the second file is in charge of sending the jobs from the master node to the slaves.
Also, did you test whether your MPI library works on the heterogeneous machine when making the MPI_* calls and packing data into a buffer using the MPI_Pack/MPI_Unpack calls? There might be a problem with pack/unpack on your system.
Well the problem is there also in homogeneous configuration, both in local and remote execution. I never used before those MPI calls, but I tried different setups (e.g., openMPI vs MPICH2, gentoo vs ubuntu, x86 vs ppc64, gcc 4.4 vs 4.5) and all have the same problem. Valgrind comes out completely clean too :/ I'll see if I can get the hang of the MPI (un)pack calls. Cheers, Francesco.

Hi On 3 Sep 2010, at 18:36, Francesco Biscani wrote:
Hi Matthias,
probably I'm doing something really stupid, but it seems the problem is somehow related to shared_ptr. This code reproduces the "MPI message truncated error":
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <boost/serialization/assume_abstract.hpp> #include <boost/serialization/export.hpp> #include <boost/serialization/base_object.hpp> #include <boost/serialization/shared_ptr.hpp> #include <boost/serialization/tracking.hpp> #include <boost/serialization/vector.hpp> #include <boost/shared_ptr.hpp> #include <vector>
struct base { virtual void do_something() const = 0; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & values; } std::vector<double> values; virtual ~base() {} };
BOOST_SERIALIZATION_ASSUME_ABSTRACT(base);
struct derived: public base { void do_something() const {}; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & boost::serialization::base_object<base>(*this); } };
BOOST_CLASS_EXPORT(derived);
struct container { template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & ptr; } boost::shared_ptr<base> ptr; };
int main() { boost::mpi::environment env; boost::mpi::communicator world; if (world.rank() == 0) { boost::shared_ptr<container> c(new container()); world.send(1,0,c); world.recv(1,0,c); } else { boost::shared_ptr<container> c(new container()); world.recv(0,0,c); world.send(0,0,c); } return 0; }
The error happens when rank 1 is receiving the object:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated
Hi Francesco, The support for shared_ptr was incomplete. Can you try this example with the current SVN trunk? Matthias

Hi Matthias, thanks for looking into this. I'm going to test the trunk (probably during the weekend, as I'm at work busy with other matters right now) and report back here. Cheers, Francesco On Thu, Sep 23, 2010 at 9:41 PM, Matthias Troyer <troyer@phys.ethz.ch> wrote:
Hi On 3 Sep 2010, at 18:36, Francesco Biscani wrote:
Hi Matthias,
probably I'm doing something really stupid, but it seems the problem is somehow related to shared_ptr. This code reproduces the "MPI message truncated error":
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <boost/serialization/assume_abstract.hpp> #include <boost/serialization/export.hpp> #include <boost/serialization/base_object.hpp> #include <boost/serialization/shared_ptr.hpp> #include <boost/serialization/tracking.hpp> #include <boost/serialization/vector.hpp> #include <boost/shared_ptr.hpp> #include <vector>
struct base { virtual void do_something() const = 0; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & values; } std::vector<double> values; virtual ~base() {} };
BOOST_SERIALIZATION_ASSUME_ABSTRACT(base);
struct derived: public base { void do_something() const {}; template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & boost::serialization::base_object<base>(*this); } };
BOOST_CLASS_EXPORT(derived);
struct container { template <class Archive> void serialize(Archive &ar, const unsigned int) { ar & ptr; } boost::shared_ptr<base> ptr; };
int main() { boost::mpi::environment env; boost::mpi::communicator world; if (world.rank() == 0) { boost::shared_ptr<container> c(new container()); world.send(1,0,c); world.recv(1,0,c); } else { boost::shared_ptr<container> c(new container()); world.recv(0,0,c); world.send(0,0,c); } return 0; }
The error happens when rank 1 is receiving the object:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated
Hi Francesco,
The support for shared_ptr was incomplete. Can you try this example with the current SVN trunk?
Matthias
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Hi Francesco,
The support for shared_ptr was incomplete. Can you try this example with the current SVN trunk?
Matthias
Matthias, I tried this example using revision and I get this error: terminate called after throwing an instance of 'boost::exception_detail::clone_impl <boost::exception_detail::error_info_injector<boost::mpi::exception>>' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated [...] I tested this because I try to do something very similar but it also does not work. See my code example here: http://gist.github.com/613446 (see error message here http://gist.github.com/613479) Am I messing up here or is this a problem with Boost MPI possibly in how it interacts with Boost Serialization? Also I had a version of this code where shared_ptr was replaced by a regular pointer which also did not work [0]. Is it a bad idea in general to use a regular pointers when sending stuff through MPI using serialization? Best, Sebastian [0]: https://gist.github.com/613446/afba16ea4711f8674060ffcdbb41e85188d572c7

On 7 Oct 2010, at 04:37, Sebastian Schaetz wrote:
Hi Francesco,
The support for shared_ptr was incomplete. Can you try this example with the current SVN trunk?
Matthias
Matthias, I tried this example using revision and I get this error:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl <boost::exception_detail::error_info_injector<boost::mpi::exception>>' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated [...]
I tested this because I try to do something very similar but it also does not work. See my code example here: http://gist.github.com/613446 (see error message here http://gist.github.com/613479)
Am I messing up here or is this a problem with Boost MPI possibly in how it interacts with Boost Serialization?
Also I had a version of this code where shared_ptr was replaced by a regular pointer which also did not work [0]. Is it a bad idea in general to use a regular pointers when sending stuff through MPI using serialization?
Hi Sebastian, Thank you for posting a complete example that exhibits the problem! I could further simplify it and then solve it - the problem was once more an undocumented requirement of the latest version of the Boost.Serialization library. I applied workarounds in Boost.MPI and now both the raw pointer and shared pointer cases work on my machine. I have also added a new regression test based on a simplified version of your example. It is fixed on the trunk now and should make it into 1.45 Matthias

Hello, I have a similar problem here. I try to send data from one process to another (mpirun -np 2). The dataI use is serialized in the appropriate way. If I send it to a text archive I can it restore again from this text archive and all is ok. But when I try to send the data between the processes something goes wrong and the data is not restored correctly. I have the following serialization routine: template<typename Float> template<class Archive> void Ball<Float>::serialize( Archive &ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( members ) // std::vector<int> members // and so on... // Testing: for( int i=0; i<members.size(); ++i ) std::cout << members[i] << " "; std::cout << std::endl; } the output on sending such an object is (for example) 0 1 3 2 the output on receiving the same object is 0 0 0 0 So there seems to be something wrong during storing the data in ar or during the restoration process. Did someone in this thread come already to a solution to this problem? Cheers, Martin Am 03.09.2010 11:31, schrieb Francesco Biscani:
Hi Matthias,
I updated to Boost 1.44.0 but unfortunately the crash is now even in local mode (mpirun -np 2). The strange thing is that the serialization code is apparently working fine when used with text archives, but with MPI archives the slave process, upon reception, is deserializing the objects with seemingly random values (e.g., huge values instead of 1 or 0 for an integer data member of a structure).
I'm trying to isolate the problem right now and, in case I can reproduce it with a minimal example, I will post it here (though it is likely some mistake on my part, it's the first time I use MPI and serialization libraries).
Cheers,
Francesco
On Thu, Sep 2, 2010 at 4:26 AM, Matthias Troyer<troyer@phys.ethz.ch> wrote:
On Sep 2, 2010, at 7:39, Francesco Biscani<bluescarni@gmail.com> wrote:
Hello,
I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here:
Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters?
Hi Francesco,
Have you checked whether a program using the MPI C API can correctly send data on your heterogeneous cluster? Boost.MPI uses the support for heterogeneous machines of the underlying MPI library unless you define the macro BOOST_MPI_HOMOGENOUS.
Have you also tried the latest Boost release?
Matthias _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On Sep 16, 2010, at 1:21 PM, Martin Hünniger wrote:
Hello,
I have a similar problem here. I try to send data from one process to another (mpirun -np 2). The dataI use is serialized in the appropriate way. If I send it to a text archive I can it restore again from this text archive and all is ok. But when I try to send the data between the processes something goes wrong and the data is not restored correctly.
I have the following serialization routine:
template<typename Float> template<class Archive> void Ball<Float>::serialize( Archive &ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( members ) // std::vector<int> members // and so on...
// Testing: for( int i=0; i<members.size(); ++i ) std::cout << members[i] << " "; std::cout << std::endl; }
the output on sending such an object is (for example) 0 1 3 2
the output on receiving the same object is 0 0 0 0
So there seems to be something wrong during storing the data in ar or during the restoration process.
Did someone in this thread come already to a solution to this problem?
Cheers, Martin
I'll take a look this weekend. I assume the "// and so on.." does not do any further serialization? Matthias

Hi Matthias, yes //and so on does further serializations. Here is the code, there a some redundancies for later design decisions: void Ball<Float>::serialize( Archive & ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( membership ); //vector<bool> ar & BOOST_SERIALIZATION_NVP( members ); //vector<int> ar & BOOST_SERIALIZATION_NVP( r ); //int ar & BOOST_SERIALIZATION_NVP( is_infinity ); //bool ar & BOOST_SERIALIZATION_NVP( up_to_date ); //bool ar & BOOST_SERIALIZATION_NVP( *QR ); //own type, see below //if( !up_to_date ) //update(); for( int i=0; i<members.size(); ++i ) { std::cout << members[i] << " "; } std::cout << std::endl; for( int i=0; i<membership.size(); ++i ) { std::cout << membership[i] << " "; } std::cout << std::endl; std::cout << r << std::endl; std::cout << is_infinity << std::endl; } void Subspan<Float>::serialize( Archive & ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( membership ); // vector<bool> ar & BOOST_SERIALIZATION_NVP( members ); // vector<int> for( int i=0; i<dim; ++i ) for( int j=0; j<dim; ++j ) ar & BOOST_SERIALIZATION_NVP( Q[i][j] ); // double for( int i=0; i<dim; ++i ) for( int j=0; j<dim; ++j ) ar & BOOST_SERIALIZATION_NVP( R[i][j] ); // double for( int i=0; i<dim; ++i ) ar & BOOST_SERIALIZATION_NVP( x[i] ); // double for( int i=0; i<dim; ++i ) ar & BOOST_SERIALIZATION_NVP( d[i] ); // double ar & BOOST_SERIALIZATION_NVP( r ); // int } I don't think the problem lies in my serialization routine, because the output of the serialized stuff looks right. In fact I think, that the serialization is broken in some way, because if I transmit the serialized object through a text_archive between 2 processes, the data gets corrupted sometimes. Example: Master sends the data to worker 1 22 serialization::archive 7 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 2 0 1 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 -0.0057421862890573716 0.34419941995529357 -0.93887900530316526 0.026836519283445603 -0.93850327402166567 -0.34422580653310031 -0.99962334332956182 -0.027172853237254602 -0.0038480537386582914 1.3412866491336766 0 0 0.25571772247364938 0.83569200171657654 0 -1.3694313541089012 1.1400579617129889 -0.7062870570269757 -0.68908851940039961 -0.014041993697918429 -0.0017429988348310888 0.95929305277075005 0.17642171085439082 0.22052713856798853 2 Worker 1 is receiving job from Master 22 serialization::archive 7 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 2 0 1 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 -0.0057421862890573716 0.34419941995529357 -0.93887900530316526 0.026836519283445603 -0.93850327402166567 -0.34422580653310031 -0.99962334332956182 -0.027172853237254602 -0.0038480537386582914 1.3412866491336766 0 0 0.25571772247364938 0.83569200171657654 0 -1.3694313541089012 1.1400579617129889 -0.7062870570269757 -0.68908851940039961 -0.014041993697918429 -0.0017429988348310888 0.95929305277075005 0.17642171085439082 0.220527138567988P Maybe you'll notice the "P" instead of the "53 2" at the end of the transmission. I have no clue how this can happen. But when it happens its only long after my program started and the same piece of code that generates this output has been called for like 50 times. If I use a binary_archive the program terminates with the error: Master sends size of data to worker 1 Master sends the data to worker 1 serialization::archive G?<??Ô????M?????????????)???k* x???|?n???x???A"????r?H??<?:??t_??!???*?????X?0?????????J?W??V?M???N?z???f<???f<????g?P4? Worker 1 is receiving job from Master terminate called after throwing an instance of 'boost::archive::archive_exception' what(): invalid signature [ipc858:10286] *** Process received signal *** [ipc858:10286] Signal: Aborted (6) [ipc858:10286] Signal code: (-6) [ipc858:10286] [ 0] /lib/libpthread.so.0 [0x7f0d10392a80] [ipc858:10286] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f0d10062ed5] [ipc858:10286] [ 2] /lib/libc.so.6(abort+0x183) [0x7f0d100643f3] [ipc858:10286] [ 3] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114) [0x7f0d10b02294] [ipc858:10286] [ 4] /usr/lib/libstdc++.so.6 [0x7f0d10b00696] [ipc858:10286] [ 5] /usr/lib/libstdc++.so.6 [0x7f0d10b006c3] [ipc858:10286] [ 6] /usr/lib/libstdc++.so.6 [0x7f0d10b007aa] [ipc858:10286] [ 7] /home/pirx/local/lib/libboost_serialization.so.1.44.0(_ZN5boost7archive21basic_binary_iarchiveINS0_15binary_iarchiveEE4initEv+0x166) [0x7f0d11d1c726] [ipc858:10286] [ 8] ./my_complex(_ZN2FC6WorkerINS_4BallIdEES2_E8get_workERS2_+0x227) [0x441677] [ipc858:10286] [ 9] ./my_complex(_ZN2FC12My_complexIdE13working_horseERSo+0x1a4) [0x442c24] [ipc858:10286] [10] ./my_complex(_ZN2FC12Flow_complexIdE7computeERSo+0x87) [0x445097] [ipc858:10286] [11] ./my_complex(main+0xc44) [0x42c9f4] [ipc858:10286] [12] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f0d1004f1a6] [ipc858:10286] [13] ./my_complex(__gxx_personality_v0+0x191) [0x42b6a9] [ipc858:10286] *** End of error message *** mpirun noticed that job rank 0 with PID 10285 on node ipc858 exited on signal 15 (Terminated). Regards, Martin Matthias Troyer wrote:
On Sep 16, 2010, at 1:21 PM, Martin Hünniger wrote:
Hello,
I have a similar problem here. I try to send data from one process to another (mpirun -np 2). The dataI use is serialized in the appropriate way. If I send it to a text archive I can it restore again from this text archive and all is ok. But when I try to send the data between the processes something goes wrong and the data is not restored correctly.
I have the following serialization routine:
template<typename Float> template<class Archive> void Ball<Float>::serialize( Archive &ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( members ) // std::vector<int> members // and so on...
// Testing: for( int i=0; i<members.size(); ++i ) std::cout << members[i] << " "; std::cout << std::endl; }
the output on sending such an object is (for example) 0 1 3 2
the output on receiving the same object is 0 0 0 0
So there seems to be something wrong during storing the data in ar or during the restoration process.
Did someone in this thread come already to a solution to this problem?
Cheers, Martin
I'll take a look this weekend. I assume the "// and so on.." does not do any further serialization?
Matthias

On 20 Sep 2010, at 17:49, Martin Huenniger wrote:
Hi Matthias,
yes //and so on does further serializations. Here is the code, there a some redundancies for later design decisions:
Could you please send a stripped down but complete example code that exhibits your problems? That way it will be easiest to find what is going on Matthias

Hi, so here are the pieces of code that cause the problem: The following routine is the main loop of the master process. It dispatches the data to and from the worker processes: template<typename Float> void My_complex<Float>::dispatch() { FC_ASSERT( comm.rank() == 0 ); Master<Ball<Float>, Ball<Float> > master_process( comm ); while( !points_to_explore->empty() || master_process.some_working() ) { int command, w_rank; Ball<Float> *the_ball; Ball<Float> result( dim, S, S[0], 0); if( w_rank = master_process.listen( command ) ) { switch( command ) { case MW::ask_for_job: if( !points_to_explore->empty() ) { the_ball = points_to_explore->top(); points_to_explore->pop(); // Here is the place where the error occurs: master_process.send_work( w_rank, *the_ball ); delete the_ball; } else { master_process.worker_try_again( w_rank ); } break; case MW::return_result: // In this piece of code the problem also appears, but the // code is essetialy the same with the roles of master and // worker exchanged... master_process.get_result( w_rank, result ); the_ball = new Ball<Float>( result ); enqueue( the_ball ); // enques in points_to_explore break; case MW::job_done: master_process.free_worker( w_rank ); break; } } sleep(1); } master_process.suspend_all_workers(); return; } At the very same time the worker processes run the next routine. They receive a ball from the master and explore it. During exploration new found balls are sent back to the master. template<typename Float> void My_complex<Float>::working_horse( std::ostream &s ) { FC_ASSERT( comm.rank() != 0 ); worker_process = new Worker_type( comm ); Ball<Float> data( dim, S, S[0], 0); // here the data is received from the master process: while( worker_process->get_work( data ) ) { Ball<Float> *the_ball = new Ball<Float>( data ); explore_cell( the_ball, the_ball, s ); worker_process->done(); } delete worker_process; } Communication from master to worker is implemented in the next snippet. That is the code that generates the output provided in the last mail. The idea is the following: Since the boost::mpi::communicator::send() and boost::mpi::communicator::recv() routines refuse to work with my serialization, I implemented the communication using the C Bindings of MPI. So the data gets serialized in some text_[io]archive over a std::stringstream (binary_[io]archive isn't working), and the string extracted from this stringstream is sent over the communication channel. On receiving it gets corrupted sometimes. #include <boost/mpi.hpp> #include <string> #include <sstream> #include <boost/serialization/string.hpp> #include <boost/archive/text_oarchive.hpp> #include <boost/archive/text_iarchive.hpp> #include <mpi.h> #include <cstdlib> typedef boost::archive::text_oarchive Oarchive; typedef boost::archive::text_iarchive Iarchive; template<typename Jobtype, typename Restype> void Master<Jobtype,Restype>::send_work( const int w, const Jobtype &data ) { FC_DEBUG_OUTPUT( << "Master sends data to worker " << w << "\n" ); /* * comm.send( w, MW::send_job_data_tag, data ); */ { std::stringstream ss; { Oarchive oa(ss); oa << data; } int size = ss.str().size(); comm.send( w, MW::send_job_data_tag ); FC_DEBUG_OUTPUT( << "Master sends size of data to worker " << w << "\n" ); MPI_Send( &size, 1, MPI_INT, w, MW::send_job_data_size, MPI_Comm(comm) ); FC_DEBUG_OUTPUT( << "Master sends the data to worker " << w << "\n" ); FC_DEBUG_OUTPUT( << ss.str() << "\n" ); char *buf = const_cast<char*>( ss.str().c_str() ); MPI_Send( buf , size, MPI_CHAR, w, MW::send_job_data_tag, MPI_Comm(comm) ); } if( working[w] == MW::worker_free ) { working[w] = MW::worker_working; ++num_working_workers; } } template<typename Jobtype, typename Restype> bool Worker<Jobtype,Restype>::get_work( Jobtype &data ) { FC_DEBUG_OUTPUT( << "Worker process " << comm.rank() << " is trying to get some work.\n" ); while( true ) { int ask_for_job = MW::ask_for_job; comm.send( 0, MW::listen_tag, ask_for_job ); mpi::status status = comm.probe( 0, mpi::any_tag ); if( status.tag() == MW::try_again_tag ) { FC_DEBUG_OUTPUT( << "Worker " << comm.rank() << " keeps on waiting...\n" ); sleep(1); //usleep(1); continue; } if( status.tag() == MW::suspend_tag ) { FC_DEBUG_OUTPUT( << "Worker " << comm.rank() << " is suspended.\n" ); comm.recv( 0, MW::suspend_tag ); return false; } /* * comm.recv( 0, MW::send_job_data_tag, data ); */ if( status.tag() == MW::send_job_data_tag ) { comm.recv( 0, MW::send_job_data_tag ); FC_DEBUG_OUTPUT( << "Worker " << comm.rank() << " is receiving job from Master\n" ); int size; MPI_Status mstatus; MPI_Recv( &size, 1, MPI_INT, 0, MW::send_job_data_size, MPI_Comm(comm), &mstatus ); char *buf = static_cast<char*>( malloc( size ) ); MPI_Recv( buf, size, MPI_CHAR, 0, MW::send_job_data_tag, MPI_Comm(comm), &mstatus ); std::string s( buf ); std::stringstream ss( s ); FC_DEBUG_OUTPUT( << ss.str() << "\n" ); { Iarchive ia( ss ); ia >> data; } free( buf ); FC_DEBUG_OUTPUT( << "Worker " << comm.rank() << " received job.\n" ); return true; } } } Once again the serialization: void Ball<Float>::serialize( Archive & ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( membership ); //vector<bool> ar & BOOST_SERIALIZATION_NVP( members ); //vector<int> ar & BOOST_SERIALIZATION_NVP( r ); //int ar & BOOST_SERIALIZATION_NVP( is_infinity ); //bool ar & BOOST_SERIALIZATION_NVP( up_to_date ); //bool ar & BOOST_SERIALIZATION_NVP( *QR ); //own type, see below } void Subspan<Float>::serialize( Archive & ar, const unsigned int version ) { ar & BOOST_SERIALIZATION_NVP( membership ); // vector<bool> ar & BOOST_SERIALIZATION_NVP( members ); // vector<int> for( int i=0; i<dim; ++i ) for( int j=0; j<dim; ++j ) ar & BOOST_SERIALIZATION_NVP( Q[i][j] ); // double for( int i=0; i<dim; ++i ) for( int j=0; j<dim; ++j ) ar & BOOST_SERIALIZATION_NVP( R[i][j] ); // double for( int i=0; i<dim; ++i ) ar & BOOST_SERIALIZATION_NVP( x[i] ); // double for( int i=0; i<dim; ++i ) ar & BOOST_SERIALIZATION_NVP( d[i] ); // double ar & BOOST_SERIALIZATION_NVP( r ); // int } The template-parameters Jobtype and Restype are instatiated to Ball<double>. I read elsewhere that using text_archives with doubles is generally a bad idea since receiving NaN's and inf's is not possible. As the output in the last mail shows, this was not the case. Maybe you have an idea what is causing the trouble with the binary_archive, because using these would seem a lot more safer to me. All the best Martin Matthias Troyer wrote:
On 20 Sep 2010, at 17:49, Martin Huenniger wrote:
Hi Matthias,
yes //and so on does further serializations. Here is the code, there a some redundancies for later design decisions:
Could you please send a stripped down but complete example code that exhibits your problems? That way it will be easiest to find what is going on
Matthias

On 21 Sep 2010, at 10:59, Martin Huenniger wrote:
Hi,
so here are the pieces of code that cause the problem:
The following routine is the main loop of the master process. It dispatches the data to and from the worker processes:
'''
At the very same time the worker processes run the next routine. They receive a ball from the master and explore it. During exploration new found balls are sent back to the master.
...
Communication from master to worker is implemented in the next snippet. That is the code that generates the output provided in the last mail.
The idea is the following: Since the boost::mpi::communicator::send() and boost::mpi::communicator::recv() routines refuse to work with my serialization, I implemented the communication using the C Bindings of MPI. So the data gets serialized in some text_[io]archive over a std::stringstream (binary_[io]archive isn't working), and the string extracted from this stringstream is sent over the communication channel. On receiving it gets corrupted sometimes. ...
The template-parameters Jobtype and Restype are instatiated to Ball<double>.
I read elsewhere that using text_archives with doubles is generally a bad idea since receiving NaN's and inf's is not possible. As the output in the last mail shows, this was not the case. Maybe you have an idea what is causing the trouble with the binary_archive, because using these would seem a lot more safer to me.
Can you just attach a tarball or single source file instead of code fragments Matthias

Hi, I don't want to make my code public at this moment. It looks hard to find a minimal example, since the error happens only now and then. I'll try to stick to the C Bindings and skip the serialization stuff. Thanks for the help. Regards, Martin Matthias Troyer wrote:
On 21 Sep 2010, at 10:59, Martin Huenniger wrote:
Hi,
so here are the pieces of code that cause the problem:
The following routine is the main loop of the master process. It dispatches the data to and from the worker processes:
'''
At the very same time the worker processes run the next routine. They receive a ball from the master and explore it. During exploration new found balls are sent back to the master.
...
Communication from master to worker is implemented in the next snippet. That is the code that generates the output provided in the last mail.
The idea is the following: Since the boost::mpi::communicator::send() and boost::mpi::communicator::recv() routines refuse to work with my serialization, I implemented the communication using the C Bindings of MPI. So the data gets serialized in some text_[io]archive over a std::stringstream (binary_[io]archive isn't working), and the string extracted from this stringstream is sent over the communication channel. On receiving it gets corrupted sometimes. ...
The template-parameters Jobtype and Restype are instatiated to Ball<double>.
I read elsewhere that using text_archives with doubles is generally a bad idea since receiving NaN's and inf's is not possible. As the output in the last mail shows, this was not the case. Maybe you have an idea what is causing the trouble with the binary_archive, because using these would seem a lot more safer to me.
Can you just attach a tarball or single source file instead of code fragments
Matthias

On 21 Sep 2010, at 12:35, Martin Huenniger wrote:
Hi,
I don't want to make my code public at this moment. It looks hard to find a minimal example, since the error happens only now and then. I'll try to stick to the C Bindings and skip the serialization stuff.
Thanks for the help.
Regards, Martin
I'm sorry, but I cannot help you if you don't send me a code example that exhibits the problem. Matthias

Hi, the problem is solved: the bug originated from two issues: 1) int size = ss.str().size(); it is not wise to forget to send the terminating \0 of a C-string. So there are 2 solutions: int size = ss.str().size+1 or char *buf = const cast<char*>( ss.str.data() ); The first is to be prefered. Because 2.) This fragment is _bad_: MPI_Send( &size, 1, MPI_INT, w, MW::send_job_data_size, MPI_Comm(comm) ); char *buf = const_cast<char*>( ss.str().c_str() ); MPI_Send( buf , size, MPI_CHAR, w, MW::send_job_data_tag, MPI_Comm(comm) ); Why? char *buf gets initialized with the address of a temporary copy of the C-string corresponding to the stringstream ss's string. So when MPI_Send is invoked the pointer buf points to some memory that is not guaranteed to hold the expected content. So the solution is: std::stringstream ss; { Oarchive oa(ss); oa << data; } int size = ss.str().size+1 MPI_Send( &size, 1, MPI_INT, w, MW::send_job_data_size, MPI_Comm(comm) ); MPI_Send( const_cast<char*>( ss.str.c_str() ), size, MPI_CHAR, w, MW::send_job_data_tag, MPI_COMM(comm) ) The next problem is the receiving of binary_archives: Its solution is also a bit under the hood int size; MPI_Status mstatus; MPI_Recv( &size, 1, MPI_INT, 0, MW::send_job_data_size, MPI_Comm(comm), &mstatus ); char *buf = static_cast<char*>( malloc( size ) ); MPI_Recv( buf, size, MPI_CHAR, 0, MW::send_job_data_tag, MPI_Comm(comm), &mstatus ); std::string s( buf ); std::stringstream ss( s ); The problem here is the following: we receive size C-characters and try to generaterate a C++ string from it. We are using std::string s( buf ). Here lies the error: string::string( char * ) expects a C-string, that is a \0-termionated sequence of chars. If we have in buf a binary_archive, the the probability of having a \0 in some place is very high and therefore s only holds a part of the transmitted information. Better use std::string s( buf, size) to initialize a string of length size with the data buf. And that fixes the issue. Thanks to our Ph.D student Jens for help with that. Cheers, Martin

On 21 Sep 2010, at 16:01, Martin Huenniger wrote:
Hi,
the problem is solved:
the bug originated from two issues: 1) ...
2.) This fragment is _bad_:
...
The next problem is the receiving of binary_archives: Its solution is also a bit under the hood
....
Why don't you use the packed MPI archives that should avoid all those issues? Matthias

Hi, because I have never heard of them before. I just used the information provided by the tutorials for the Boost.MPI and boost/serialization libraries. Maybe I'll try them. BTW. Somehow the standard communication routines don't work. I tried it again and I got the following error: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
' what(): MPI_Recv: MPI_ERR_TRUNCATE: message truncated [ipc858:16349] *** Process received signal *** [ipc858:16349] Signal: Aborted (6) [ipc858:16349] Signal code: (-6) [ipc858:16349] [ 0] /lib/libpthread.so.0 [0x7fb13d04ea80] [ipc858:16349] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7fb13cd1eed5] [ipc858:16349] [ 2] /lib/libc.so.6(abort+0x183) [0x7fb13cd203f3] [ipc858:16349] [ 3] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114) [0x7fb13d7be294] [ipc858:16349] [ 4] /usr/lib/libstdc++.so.6 [0x7fb13d7bc696] [ipc858:16349] [ 5] /usr/lib/libstdc++.so.6 [0x7fb13d7bc6c3] [ipc858:16349] [ 6] /usr/lib/libstdc++.so.6 [0x7fb13d7bc7aa] [ipc858:16349] [ 7] ./my_complex(_ZN5boost15throw_exceptionINS_3mpi9exceptionEEEvRKT_+0x1ef) [0x42f98f] [ipc858:16349] [ 8] /home/pirx/local/lib/libboost_mpi.so.1.44.0(_ZNK5boost3mpi12communicator4recvEii+0x80) [0x7fb13ec2aba0] [ipc858:16349] [ 9] ./my_complex(_ZN2FC6WorkerINS_4BallIdEES2_E8get_workERS2_+0x76) [0x434316] [ipc858:16349] [10] ./my_complex(_ZN2FC12My_complexIdE13working_horseERSo+0xbf) [0x43b5af] [ipc858:16349] [11] ./my_complex(_ZN2FC12My_complexIdE7computeERSo+0x1d3) [0x43b803] [ipc858:16349] [12] ./my_complex(main+0x7e6) [0x427cd6] [ipc858:16349] [13] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fb13cd0b1a6] [ipc858:16349] [14] ./my_complex(__gxx_personality_v0+0x1b9) [0x426fe9] [ipc858:16349] *** End of error message ***
Since my workaround ehibits no errors, I assume that there is someproblem within the communication routines of Boost.MPI. Cheers, Martin Matthias Troyer wrote:
On 21 Sep 2010, at 16:01, Martin Huenniger wrote:
Hi,
the problem is solved:
the bug originated from two issues: 1) ...
2.) This fragment is _bad_:
...
The next problem is the receiving of binary_archives: Its solution is also a bit under the hood
....
Why don't you use the packed MPI archives that should avoid all those issues?
Matthias

Hi, again, it is impossible to find a bug if you are unwilling to send code that shows the bug. Matthias Sent from my iPad On Sep 22, 2010, at 12:31, Martin Huenniger <m.huenniger@uni-jena.de> wrote:
Hi,
because I have never heard of them before. I just used the information provided by the tutorials for the Boost.MPI and boost/serialization libraries.
Maybe I'll try them.
BTW. Somehow the standard communication routines don't work. I tried it again and I got the following error:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >' what(): MPI_Recv: MPI_ERR_TRUNCATE: message truncated [ipc858:16349] *** Process received signal *** [ipc858:16349] Signal: Aborted (6) [ipc858:16349] Signal code: (-6) [ipc858:16349] [ 0] /lib/libpthread.so.0 [0x7fb13d04ea80] [ipc858:16349] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7fb13cd1eed5] [ipc858:16349] [ 2] /lib/libc.so.6(abort+0x183) [0x7fb13cd203f3] [ipc858:16349] [ 3] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114) [0x7fb13d7be294] [ipc858:16349] [ 4] /usr/lib/libstdc++.so.6 [0x7fb13d7bc696] [ipc858:16349] [ 5] /usr/lib/libstdc++.so.6 [0x7fb13d7bc6c3] [ipc858:16349] [ 6] /usr/lib/libstdc++.so.6 [0x7fb13d7bc7aa] [ipc858:16349] [ 7] ./my_complex(_ZN5boost15throw_exceptionINS_3mpi9exceptionEEEvRKT_+0x1ef) [0x42f98f] [ipc858:16349] [ 8] /home/pirx/local/lib/libboost_mpi.so.1.44.0(_ZNK5boost3mpi12communicator4recvEii+0x80) [0x7fb13ec2aba0] [ipc858:16349] [ 9] ./my_complex(_ZN2FC6WorkerINS_4BallIdEES2_E8get_workERS2_+0x76) [0x434316] [ipc858:16349] [10] ./my_complex(_ZN2FC12My_complexIdE13working_horseERSo+0xbf) [0x43b5af] [ipc858:16349] [11] ./my_complex(_ZN2FC12My_complexIdE7computeERSo+0x1d3) [0x43b803] [ipc858:16349] [12] ./my_complex(main+0x7e6) [0x427cd6] [ipc858:16349] [13] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fb13cd0b1a6] [ipc858:16349] [14] ./my_complex(__gxx_personality_v0+0x1b9) [0x426fe9] [ipc858:16349] *** End of error message ***
Since my workaround ehibits no errors, I assume that there is someproblem within the communication routines of Boost.MPI.
Cheers, Martin
Matthias Troyer wrote:
On 21 Sep 2010, at 16:01, Martin Huenniger wrote:
Hi,
the problem is solved:
the bug originated from two issues: 1) ...
2.) This fragment is _bad_:
...
The next problem is the receiving of binary_archives: Its solution is also a bit under the hood
.... Why don't you use the packed MPI archives that should avoid all those issues? Matthias
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 22 Sep 2010, at 12:31, Martin Huenniger wrote:
Hi,
because I have never heard of them before. I just used the information provided by the tutorials for the Boost.MPI and boost/serialization libraries.
Maybe I'll try them.
BTW. Somehow the standard communication routines don't work. I tried it again and I got the following error:
Do you use shared pointers in your code? If so please try the current trunk version. Matthias

Hi Francesco! Binary archives are for use on one single platform only. If you want to move archives between different platforms, you have to use something portable - like xml or text archives. I guess x64 and ppc64 have different endianess and your compilers might have different type sizes for int as well. You can also have a look at my portable binary archive which you can find at the boost vault. Let me know if you do and find that it works in your case. Greetings, -- Christian Pfligersdorffer Software Engineering http://www.eos.info boost-users-bounces@lists.boost.org on :
Hello,
I'm getting a segfault when using Boost.MPI on a cluster of heterogeneous machines (x86_64 and ppc64). The problem arises when the "slave" machine, ppc64, receives its payload from the "master" machine, x86_64, and tries to unpack the archive. Tracing down the issue with valgrind and in debug mode, the problem arises here:
==28632== Invalid write of size 8 ==28632== at 0x10429DDC: boost::archive::detail::basic_iarchive_impl::load_pointer(boos t::archive::detail::basic_iarchive&, void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:453) ==28632== by 0x1042772F: boost::archive::detail::basic_iarchive::load_pointer(void*&, boost::archive::detail::basic_pointer_iserializer const*, boost::archive::detail::basic_pointer_iserializer const* (*)(boost::serialization::extended_type_info const&)) (basic_iarchive.cpp:564) ==28632== by 0x10468707: void boost::archive::detail::load_pointer_type<boost::mpi::packed_i archive>::invoke<pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:518) ==28632== by 0x104683EF: void boost::archive::load<boost::mpi::packed_iarchive, pagmo::population*>(boost::mpi::packed_iarchive&, pagmo::population*&) (iserializer.hpp:586) ==28632== by 0x10468223: void boost::archive::detail::common_iarchive<boost::mpi::packed_iar chive>::load_override<pagmo::population*>(pagmo::population*&, int) (common_iarchive.hpp:68) ==28632== by 0x10468023: void boost::archive::basic_binary_iarchive<boost::mpi::packed_iarch ive>::load_override<pagmo::population*>(pagmo::population*&, int) (basic_binary_iarchive.hpp:67) ==28632== by 0x10467E27: void boost::mpi::packed_iarchive::load_override<pagmo::population*> (pagmo::population*&, int, mpl_::bool_<false>) (packed_iarchive.hpp:98) ==28632== by 0x10467C27: void boost::mpi::packed_iarchive::load_override<pagmo::population*> (pagmo::population*&, int) (packed_iarchive.hpp:115) ==28632== by 0x1046798F: boost::mpi::packed_iarchive& boost::archive::detail::interface_iarchive<boost::mpi::packed_ iarchive>::operator>><pagmo::population*>(pagmo::population*&) (interface_iarchive.hpp:60) ==28632== by 0x104676BB: void boost::serialization::nvp<pagmo::population*>::load<boost::mpi
packed_iarchive>(boost::mpi::packed_iarchive&, unsigned int) (nvp.hpp:87) ==28632== by 0x104674AF: void boost::serialization::access::member_load<boost::mpi::packed_iarchive, boost::serialization::nvp<pagmo::population*> (boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (access.hpp:101) ==28632== by 0x104672CF: boost::serialization::detail::member_loader<boost::mpi::packed _iarchive, boost::serialization::nvp<pagmo::population*>
invoke(boost::mpi::packed_iarchive&, boost::serialization::nvp<pagmo::population*>&, unsigned int) (split_member.hpp:54) ==28632== Address 0x4b65d98 is not stack'd, malloc'd or (recently) free'd
The issue is in the method basic_iarchive_impl::load_pointer, around line 450:
int i = cid; cobject_id_vector[i].bpis_ptr = bpis_ptr;
Indeed, a printf confirms that i == 512 while cobject_id_vector.size() == 3. This also provokes the assertion new_cid == cid to fail one line below (where new_cid == 2). The same code, run locally on the ppc64 acting both as slave and master with mpirun -np 2, runs ok. Boost version is 1.42.0, MPI implementation is openMPI 1.4.2.
Can this be related to some endianness issue? Is Boost.MPI expected to work on heterogeneous clusters?
Thanks,
Francesco. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On Sep 2, 2010, at 3:12 AM, Pfligersdorffer, Christian wrote:
Hi Francesco!
Binary archives are for use on one single platform only. If you want to move archives between different platforms, you have to use something portable - like xml or text archives.
Or like MPI, which handles this stuff internally. -- David Abrahams BoostPro Computing http://boostpro.com
participants (7)
-
David Abrahams
-
Francesco Biscani
-
Martin Huenniger
-
Martin Hünniger
-
Matthias Troyer
-
Pfligersdorffer, Christian
-
Sebastian Schaetz