I am currently developing a program written in C++ with the MPI+pthread paradigm.
I add some functionality to my program, however I have a bad termination message from one MPI process, like this:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 37805 RUNNING AT node165
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@node162] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0@node162] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2@node166] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node162: task 0: Exited with exit code 7
[proxy:0:0@node162] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node166: task 2: Exited with exit code 7
[mpiexec@node162] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node162] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node162] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@node162] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
My problem is such that I have no idea about why I have this kind of message, and thus how to correct it.
I use only some basic functions from MPI, and ensure that there is no threads which uses MPI calls (only my "master process" is allowed to call such functions).
I also checked that one process does not send message to itself, and that the process destination exist before sending a message.
My question is quite simple: how to know where the problem comes from to then debug my application ?
Thank you a lot.