I have a very simple MPI program:
int my_rank;
int my_new_rank;
int size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (my_rank == 0 || my_rank == 18 || my_rank == 36){
char hostbuffer[256];
gethostname(hostbuffer, sizeof(hostbuffer));
printf("Hostname: %s\n", hostbuffer);
}
MPI_Finalize();
I am running it on a cluster with two nodes. I have a make file and with mpicc command I generate cannon.run executable file. I run it with the following command:
time mpirun --mca btl ^openib -n 64 -hostfile ../second_machinefile ./cannon.run
in second_machinefile I have names of these two nodes. The wierd problem is that, when I run this command from one node, it executes normally, however when I run the command from another node I get error:
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x30
After trying to tun with GDB I got this backtrace:
#0 0x00007ffff646e936 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#1 0x00007ffff6449733 in pmix_common_dstor_init ()
from /lib/x86_64-linux-gnu/libmca_common_dstore.so.1
#2 0x00007ffff646e5b4 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#3 0x00007ffff659e46e in pmix_gds_base_select ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#4 0x00007ffff655688d in pmix_rte_init ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#5 0x00007ffff6512d7c in PMIx_Init () from /lib/x86_64-linux-gnu/libpmix.so.2
#6 0x00007ffff660afe4 in ext2x_client_init ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so
#7 0x00007ffff72e1656 in ?? ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#8 0x00007ffff7a9d11a in orte_init ()
from /lib/x86_64-linux-gnu/libopen-rte.so.40
#9 0x00007ffff7d6de62 in ompi_mpi_init ()
from /lib/x86_64-linux-gnu/libmpi.so.40
#10 0x00007ffff7d9c17e in PMPI_Init () from /lib/x86_64-linux-gnu/libmpi.so.40
#11 0x00005555555551d6 in main ()
which to be honest I don't fully understand.
My main confusion is that the program is executed properly from machine_1, it connects to the machine_2 without errors and processes are initialized on both machines. But when I try to execute the same command from machine_2, it is not able to connect machine_1. The program is also running correctly if I run it only on machine_2 as well, when decreasing number of processes so it fits in one machine.
Is there anything I am doing wrong? or what could I try to understand better the cause of the problem?
mpichbut the traces are clearly Open MPI. are you surempirun, and the MPI libraries on all the nodes are from the same vendor & version? I assume you are invokingmpirunonmachine_1. In that case, what if you simplympirun --host machine_2 --mca btl ^openib -np 1 ./cannon.run? As a temporary workaround, you can tryexport PMIX_MCA_gds=^ds21, and then try again your initialmpiruncommand line. - Gilles Gouaillardetgdsis a PMIX framework that requires one component. The available ones areds12,hashand the defaultds21. Since you are facing an issue with theds21component, the workaround blacklist it. Note you can make the change system wide by addinggds = ^ds21in all your /.../etc/pmix-mca-params.conf. A better option is to use the latest Open MPI (4.0.2) (it can be installed in your $HOME directory and hence do not requireroot` access to your machines) - Gilles Gouaillardet