Boost logo

Boost-MPI :

Subject: Re: [Boost-mpi] Is there a way to make a master process to ignore terminations of slave processes?
From: Hal Finkel (hfinkel_at_[hidden])
Date: 2012-12-12 21:00:57


----- Original Message -----
> From: "Izhar Wallach" <izhar.wallach_at_[hidden]>
> To: boost-mpi_at_[hidden]
> Sent: Wednesday, December 12, 2012 12:42:33 PM
> Subject: [Boost-mpi] Is there a way to make a master process to ignore terminations of slave processes?
>
>
> Hello,
>
> I have a MPI program that creates multiple slaves which send data to
> a master process. The master just holds an array of requests and
> whenever a slave is ready, it receives the data the slave had sent
> it and sends back the slave another request. The jobs of the salves
> are independent of each other and of the master such that if a slave
> dies, or for any other reason, stops sending data to the master, the
> execution of the program should not be affected. This should be
> useful when running a long job on a large number of nodes where a
> single failure is more likely to occur.
>
> I was wondering if there is a way to configure the MPI to ignore
> process failures.

I believe that within the context of standard MPI the answer is no. There had been talk of including fault-tolerance support in the new MPI 3 standard, but it does not look like that happened. MPI 3 section 8.3 says, "After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected."

There are some related projects at which you might look:
http://www.open-mpi.org/faq/?category=ft
http://icl.cs.utk.edu/harness/

But there is no Boost.MPI support for (non-standard) fault tolerance mechanisms at this time.

 -Hal

> Right now, if I manually kill one of the slave
> processes all the other processes terminate as well. In other words,
> if I have 2 slaves and a master, and one of the slave processes
> dies, I would like the remining master and slave processes to keep
> running.
>
> I'm using Boost-MPI version 1.52 and mpich2 version 1.5.
>
> Thanks,
> Izhar
>
> _______________________________________________
> Boost-mpi mailing list
> Boost-mpi_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-mpi
>

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Boost-Commit list run by troyer at boostpro.com