Boost logo

Boost Users :

Subject: Re: [Boost-users] [EXTERNAL] bjam hangs on select (in develop branch)
From: Belcourt, Kenneth (kbelco_at_[hidden])
Date: 2014-10-20 18:11:21


On Oct 20, 2014, at 4:02 PM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:

> Hi Noel,
>
> No, no -j option.

Interesting. I’ve usually seen bjam miss the subprocess termination signal when -j is around 64 or more. I’ve got an Intel MPI setup I can try to reproduce than Zombie child with.

This is code I added quite a few years ago so I’ll have to dust off my bjam hat and track this down. Sorry about the hassle, it might take me a few days before I can debug this.

— Noel

> I tried the -p (since bjam is hangin in a select on output streams) with no effect.
> I don't know if that's relevant but it seems that most calls to setpgid (and those on the sh process) sets errno to 13 (permission problem)..
> The select is waiting (without -p) on the stdout of the 'sh' process (wit the redirected stderr).
> If I replace mpiexec.hydra (a binary) with mpirun (a wrapper around that binary) only mpiexec.hydra will be defunct.
>
> PID USER PR NI S %CPU TIME+ PPID COMMAND
> 769 alainm 20 0 S 0.0 0:02.79 768 bjam
> 1028 alainm 20 0 T 0.0 0:00.00 769 sh
> 1029 alainm 20 0 T 0.0 0:00.00 1028 mpirun
> 1034 alainm 20 0 Z 0.0 0:00.00 1029 mpiexec.hydra <defunct>
>
> Alain
>
> On 20/10/2014 19:10, Belcourt, Kenneth wrote:
>> Hi Alian,
>>
>> I’ve seen this problem before but it appears to affect very few people so I’ve not needed to fix it. Perhaps the time has come to address it.
>>
>> Was bjam passed a -j option, if so, what was it?
>>
>> — Noel
>>
>> On Oct 20, 2014, at 9:33 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>
>>> Hi,
>>>
>>> I am trying to test Boost.MPI with Intel's implementation and I am stuck while trying to run simple tests through bjam.
>>> Bjam is hangs on the select (not pselect ?) instruction of the unix exec_wait.
>>> As far as processes are concerned:
>>>
>>> PID USER PR NI S %CPU TIME+ PPID COMMAND
>>> .......................
>>> 16882 alainm 20 0 S 0.0 0:01.61 6507 bjam
>>> 16899 alainm 20 0 T 0.0 0:00.00 16882 sh
>>> 16900 alainm 20 0 Z 0.0 0:00.00 16899 mpiexec.hydra <defunct>
>>> .......
>>>
>>> bjam calls a generated shell (below) which calls a mpiexe.hydra which work perfectly fine outside bjam.
>>> The mpiexec.hydra dies the the shell refuses to let it go.
>>>
>>> the shell script, generated by bjam, is:
>>>
>>> ===============================================
>>> [alainm_at_gurney engine]$ more /proc/16899/cmdline
>>> /bin/sh
>>> LD_LIBRARY_PATH="/gpfs/scratch/alainm/view/boost/bin.v2/libs/mpi/build/intel-linux/debug:/gpfs/scratch/alainm/view/boost/bin.v2/libs/serialization/build/intel-linux/debug:/softs/
>>> intel/composer_xe_2015.0.090/bin/lib:/softs/intel/composer_xe_2015.0.090/lib/intel64:$LD_LIBRARY_PATH"
>>> export LD_LIBRARY_PATH
>>>
>>> status=0
>>> if test $status -ne 0 ; then
>>> echo Skipping test execution due to testing.execute=off
>>> exit 0
>>> fi
>>> mpiexec.hydra -n 2 "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2" blob > "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.te
>>> st/intel-linux/debug/broadcast_stl_test-2-run.output" 2>&1
>>> status=$?
>>> echo >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output"
>>> echo EXIT STATUS: $status >> "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output"
>>> if test $status -eq 0 ; then
>>> cp "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output" "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-
>>> linux/debug/broadcast_stl_test-2-run"
>>> fi
>>> verbose=0
>>> if test $status -ne 0 ; then
>>> verbose=1
>>> fi
>>> if test $verbose -eq 1 ; then
>>> echo ====== BEGIN OUTPUT ======
>>> cat "../../../bin.v2/libs/mpi/test/broadcast_stl_test-2.test/intel-linux/debug/broadcast_stl_test-2-run.output"
>>> echo ====== END OUTPUT ======
>>> fi
>>> exit $status
>>>
>>> [alainm_at_gurney engine]$
>>> =================================================
>>>
>>>
>>> Note that select only test for the subprocess output, at the hanging point mpiexec.hydra is done with its outputs.
>>>
>>> Any idea ?
>>>
>>> Alain
>>>
>>> PS: there was a cmake based project some time ago, is it still active or is bjam here to stay ?
>>>
>>> _______________________________________________
>>> Boost-users mailing list
>>> Boost-users_at_[hidden]
>>> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>> _______________________________________________
>> Boost-users mailing list
>> Boost-users_at_[hidden]
>> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>
>
> --
> ---
> Alain
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net