Boost logo

Boost-MPI :

Subject: [Boost-mpi] crash in communicator::iprobe when running distributed program with multiple threads
From: nikhil deveratha hegde (ndhegde_at_[hidden])
Date: 2014-05-28 17:37:08


Hi All,

I am trying to run a Parallel BGL program on multiple nodes and each node has one sender and one receiver thread. I am using boost::thread library and 0.7.0 version of Parallel BGL and boost 1.54.0 version.

Sender thread reads a message buffer, does some computation and calls send_oob.
Receiver thread polls for messages, calls receive_oob and populates message buffer.

I see a crash and the call stack consistently shows this:

call stack:
======================================================
mca_pml_ob1_recv_req_start (req=0x7fff89de5e50) at pml_ob1_recvreq.c:967
#1  0x00007f388a51288d in mca_pml_ob1_iprobe (src=-1, tag=-1, comm=0x10e1bb0, matched=0x7fff89de617c, status=0x7fff89de6150) at pml_ob1_iprobe.c:38
#2  0x00007f388d0c81f7 in PMPI_Iprobe (source=-1, tag=-1, comm=0x10e1bb0, flag=<value optimized out>, status=<value optimized out>) at piprobe.c:79
#3  0x00007f388da72392 in boost::mpi::communicator::iprobe(int, int) const () from /home/min/a/hegden/ECE573/Installed/boost_1_54_0/stage/lib/libboost_mpi.so.1.54.0
#4  0x00007f388d8366d4 in boost::graph::distributed::mpi_process_group::poll(bool, int, bool) const () from /home/min/a/hegden/ECE573/Installed/boost_1_54_0/stage/lib/libboost_graph_parallel.so.1.54.0
#5  0x000000000050679c in GAL::GAL_Traverse (this=0xf58760, vis=0x7fff89de6440) at GAL.cpp:1111
#6  0x00000000004eb84f in CorrelationVisitor::VisitDFS (this=0x7fff89de6440, g=0xf58760, pid=0) at CorrelationVisitor.cpp:323
#7  0x00000000004fff6b in main (argc=2, argv=0x7fff89de6758) at TestGL.cpp:103
======================================================

Additional crash logs:

*** glibc detected *** KdTree: free(): corrupted unsorted chunks: 0x00000000018e4b70 ***
[ganymede:21773] *** Process received signal ***
[ganymede:21773] Signal: Segmentation fault (11)
[ganymede:21773] Signal code: Address not mapped (1)
[ganymede:21773] Failing at address: 0x10
======= Backtrace: =========
/lib64/libc.so.6[0x3490876166]
/lib64/libc.so.6[0x3490878ca3]
/var/scratch/openmpi/lib/openmpi/mca_coll_sync.so(+0x1429)[0x7f8e6c3ba429]
/var/scratch/openmpi/lib/libmpi.so.1(mca_coll_base_comm_unselect+0x399)[0x7f8e70441b49]
/var/scratch/openmpi/lib/libmpi.so.1(+0x50289)[0x7f8e70402289]
/var/scratch/openmpi/lib/openmpi/mca_pml_ob1.so(+0xcdf9)[0x7f8e6d87bdf9]
/var/scratch/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x75e)[0x7f8e6d875bee]
/var/scratch/openmpi/lib/libmpi.so.1(PMPI_Bsend+0xf2)[0x7f8e7041f622]
[ganymede:21773] [ 0] /lib64/libpthread.so.0() [0x349100f710]
[ganymede:21773] [ 1] /var/scratch/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x94) [0x7f8e6d87aba4]
[ganymede:21773] [ 2] /var/scratch/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_iprobe+0x12d) [0x7f8e6d87488d]
[ganymede:21773] [ 3] /var/scratch/openmpi/lib/libmpi.so.1(PMPI_Iprobe+0xc7) [0x7f8e7042a1f7]
[ganymede:21773] [ 4] /home/min/a/hegden/ECE573/Installed/boost_1_54_0/stage/lib/libboost_mpi.so.1.54.0(_ZNK5boost3mpi12communicator6iprobeEii+0x42) [0x7f8e70dd4392]
[ganymede:21773] [ 5] /home/min/a/hegden/ECE573/Installed/boost_1_54_0/stage/lib/libboost_graph_parallel.so.1.54.0(_ZNK5boost5graph11distributed17mpi_process_group4pollEbib+0x224) [0x7f8e70b986d4]
[ganymede:21773] [ 6] KdTree(_ZN3GAL12GAL_TraverseEP10GALVisitor+0x166) [0x50679c]
[ganymede:21773] [ 7] KdTree(_ZN18CorrelationVisitor8VisitDFSEP3GALi+0x4f) [0x4eb84f]
[ganymede:21773] [ 8] KdTree(main+0x388) [0x4fff6b]
[ganymede:21773] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x349081ed1d]
[ganymede:21773] [10] KdTree() [0x4eaab9]
[ganymede:21773] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21773 on node ganymede and exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks,
Nikhil



Boost-Commit list run by troyer at boostpro.com