Hello,

I have a MPI program that creates multiple slaves which send data to a master process. The master just holds an array of requests and whenever a slave is ready, it receives the data the slave had sent it and sends back the slave another request. The jobs of the salves are independent of each other and of the master such that if a slave dies, or for any other reason, stops sending data to the master, the execution of the program should not be affected. This should be useful when running a long job on a large number of nodes where a single failure is more likely to occur.

I was wondering if there is a way to configure the MPI to ignore process failures. Right now, if I manually kill one of the slave processes all the other processes terminate as well. In other words, if I have 2 slaves and a master, and one of the slave processes dies, I would like the remining master and slave processes to keep running.

I'm using Boost-MPI version 1.52 and mpich2 version 1.5.

Thanks,
Izhar