Subject: Re: [Boost-mpi] Is there a way to make a master process to ignore terminations of slave processes?
From: Hal Finkel (hfinkel_at_[hidden])
Date: 2012-12-12 21:00:57
----- Original Message -----
> From: "Izhar Wallach" <izhar.wallach_at_[hidden]>
> To: boost-mpi_at_[hidden]
> Sent: Wednesday, December 12, 2012 12:42:33 PM
> Subject: [Boost-mpi] Is there a way to make a master process to ignore terminations of slave processes?
> I have a MPI program that creates multiple slaves which send data to
> a master process. The master just holds an array of requests and
> whenever a slave is ready, it receives the data the slave had sent
> it and sends back the slave another request. The jobs of the salves
> are independent of each other and of the master such that if a slave
> dies, or for any other reason, stops sending data to the master, the
> execution of the program should not be affected. This should be
> useful when running a long job on a large number of nodes where a
> single failure is more likely to occur.
> I was wondering if there is a way to configure the MPI to ignore
> process failures.
I believe that within the context of standard MPI the answer is no. There had been talk of including fault-tolerance support in the new MPI 3 standard, but it does not look like that happened. MPI 3 section 8.3 says, "After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected."
There are some related projects at which you might look:
But there is no Boost.MPI support for (non-standard) fault tolerance mechanisms at this time.
> Right now, if I manually kill one of the slave
> processes all the other processes terminate as well. In other words,
> if I have 2 slaves and a master, and one of the slave processes
> dies, I would like the remining master and slave processes to keep
> I'm using Boost-MPI version 1.52 and mpich2 version 1.5.
> Boost-mpi mailing list
-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Boost-Commit list run by troyer at boostpro.com