Boost logo

Boost-MPI :

Subject: [Boost-mpi] deadlock (kindof) in nonbloking_test with wait_any on Intel MPI
From: Alain Miniussi (alain.miniussi_at_[hidden])
Date: 2014-09-08 12:14:55


Hi,

I have an issue with that test that goes into an infinite loop.
I am using intel MPI 4.1.3 on a linux box.

I did run nonblocking_test.cpp (the regression test of the boost
distribution (develop branch) on a single proc, and following the
debugger I get the following issue:

line 85:
// send a one elt list, that generates 2 MPI_isend, one for the
archive's size, one for the archive.
// Now, with intel's MPI, it seems that the second one won't be sent
untill the first one is received.
// Good or not, I suspect this behavior is legal
  S: reqs.push_back(comm.isend((comm.rank() + 1) % comm.size(), 0,
values[0]));
// Receive one element list, that only generate ONE MPI_irecv, for the
size. A handle is set
// on the request object to retrieve the second message. The second
request is set to null.
  R: reqs.push_back(comm.irecv((comm.rank() + comm.size() - 1) %
comm.size(),

later on:
    // reqs[0] contains 2 MPI_request object, only one is complete
    // reqs[1] contains only one MPI_request objet, and a handle
    if (wait_any(reqs.begin(), reqs.end()).second == reqs.begin())
           reqs[1].wait();
         else
           reqs[0].wait();

Let's look at wait any:
       if (current->m_requests[0] != MPI_REQUEST_NULL &&
         current->m_requests[1] != MPI_REQUEST_NULL)
         if (optional<status> result = current->test())
           return std::make_pair(*result, current);
// A: Only the first request will call test, since
current->m_requests[1] == MPI_REQUEST_NULL for the recv request
// B: For the first one, current->test() will basically call
MPI_Waitall, which will fail since the second MPI_Request seems to wait
for the first one to be consumed, which won't happen, since A.

But we have non trivial requests, so:
       // There are some nontrivial requests, so we must continue our
       // busy waiting loop.
       n = 0;
       current = first;
And so we get a dead lock.

Does anyone has any idea on how to fix this ?
I suspect the:
   if (current->m_requests[0] != MPI_REQUEST_NULL &&
         current->m_requests[1] != MPI_REQUEST_NULL)
         if (optional<status> result = current->test())
           return std::make_pair(*result, current);

should be more subtle and allow current->m_requests[1] ==
MPI_REQUEST_NULL when current->m_handle is not null....

Regards

Alain


Boost-Commit list run by troyer at boostpro.com