2
votes

I'm trying to write parallel progra with using MPI in C. However, when I run my program I get that message and my program is terminated. I do not know the reason of that error message

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.

Solution is starting

rank 7 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 7: return code 0

rank 6 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 6: return code 0

rank 4 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 4: killed by signal 9

rank 3 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 3: killed by signal 9

rank 2 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 2: return code 0

rank 0 in job 1 server_name_60409 caused collective abort of all ranks exit status of rank 0: return code 0

2
Assuming you're running Unix, signal 9 is SIGKILL. It is often triggered by invalid memory accesses (e.g. buggy code that tries to read/write/free memory that it doesn't own). However, without seeing your code, there's not much more we can tell you.suszterpatt
@suszterpatt, invalid memory access triggers SIGSEGV (signal 11), not SIGKILL.Hristo Iliev
You might be running out of memory or hitting a CPU time limit as all your processes run on the same node. If any rank dies abnormally (e.g. because of CPU or memory limits being hit), the MPI launcher would kill the remaining ranks by sending them a signal, usually SIGKILL (9).Hristo Iliev

2 Answers

0
votes

If you missed MPI_Finalize() after using MPI, it will also generate following error:

rank 3 in job 98 n01_44763 caused collective abort of all ranks
exit status of rank 3: return code 0

1
votes

My program was aborting with a similar communicate:

rank 3 in job 58409  vnode-01_39157   caused collective abort of all ranks
  exit status of rank 3: killed by signal 9 
rank 1 in job 58409  vnode-01_39157   caused collective abort of all ranks
  exit status of rank 1: killed by signal 11 

Due to too much stack memory being allocated.
Switching to heap helped.