Hi all, I've been getting some sporadic errors on larger MPI runs I'm
having difficulty tracking down, and hoping maybe someone else has seen
these before.
As far as I can tell, this only
comes up on larger core counts (400+), and with some non-determinism on other identical jobs. This is happening with intel mpi 2019, and openmpi 3.1.3.
Shortly after startup, after a successful mpi broadcast, I get many messages like these in the slurm stdout logs:
Non-fatal
temporary exhaustion of send tid dma descriptors (elapsed=161.497s,
source LID=0x9e/context=33, count=1) (err=0)
Non-fatal
temporary exhaustion of send tid dma descriptors (elapsed=161.500s,
source LID=0xae/context=27, count=27) (err=0)
On
one run, these messages stop after elapsed time 163 seconds, and the
job resumes happily with everything working for the rest of time. On
another, identical run, these messages have continued for several hours, forcing me to kill the job.
This
*seems* to happen only in the vicinity of a boost mpi broadcast that happens at the start of my program. This is using boost::mpi to broadcast a boost::serialized object containing a small amount of mixed data; maybe 80 short strings and a handful of floating point values. I have also tried faking this broadcast doing a one-by-one blocking send/recv from master to each other process, but that doesn't fix the issue.
I say it *seems* to happen here, because I do have an explicit barrier set after this broadcast, and after that the master process writes information to stdout, indicating it has succeeded.
Does this behavior sound familiar to anyone?
Thanks,
Mark Lohry