Boost logo

Boost-MPI :

From: Mark Lohry (mlohry_at_[hidden])
Date: 2019-08-31 09:26:29

Hi all, I've been getting some sporadic errors on larger MPI runs I'm
having difficulty tracking down, and hoping maybe someone else has seen
these before.

As far as I can tell, this only comes up on larger core counts (400+), and
with some non-determinism on other identical jobs. This is happening with
intel mpi 2019, and openmpi 3.1.3.

Shortly after startup, after a successful mpi broadcast, I get many
messages like these in the slurm stdout logs:

Non-fatal temporary exhaustion of send tid dma descriptors
(elapsed=161.497s, source LID=0x9e/context=33, count=1) (err=0)
Non-fatal temporary exhaustion of send tid dma descriptors
(elapsed=161.500s, source LID=0xae/context=27, count=27) (err=0)

On one run, these messages stop after elapsed time 163 seconds, and the job
resumes happily with everything working for the rest of time. On another,
identical run, these messages have continued for several hours, forcing me
to kill the job.

This *seems* to happen only in the vicinity of a boost mpi broadcast that
happens at the start of my program. This is using boost::mpi to broadcast a
boost::serialized object containing a small amount of mixed data; maybe 80
short strings and a handful of floating point values. I have also tried
faking this broadcast doing a one-by-one blocking send/recv from master to
each other process, but that doesn't fix the issue.

I say it *seems* to happen here, because I do have an explicit barrier set
after this broadcast, and after that the master process writes information
to stdout, indicating it has succeeded.

Does this behavior sound familiar to anyone?

Mark Lohry

Boost-Commit list run by troyer at