1
votes

I have a very simple MPI program:

  int my_rank;
  int my_new_rank;
  int size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);


  if (my_rank == 0 || my_rank == 18 || my_rank == 36){
    char hostbuffer[256];
    gethostname(hostbuffer, sizeof(hostbuffer));
    printf("Hostname: %s\n", hostbuffer);
  }
  MPI_Finalize();

I am running it on a cluster with two nodes. I have a make file and with mpicc command I generate cannon.run executable file. I run it with the following command:

time mpirun --mca btl ^openib -n 64 -hostfile ../second_machinefile ./cannon.run

in second_machinefile I have names of these two nodes. The wierd problem is that, when I run this command from one node, it executes normally, however when I run the command from another node I get error:

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x30

After trying to tun with GDB I got this backtrace:

#0  0x00007ffff646e936 in ?? ()
   from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#1  0x00007ffff6449733 in pmix_common_dstor_init ()
   from /lib/x86_64-linux-gnu/libmca_common_dstore.so.1
#2  0x00007ffff646e5b4 in ?? ()
   from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#3  0x00007ffff659e46e in pmix_gds_base_select ()
   from /lib/x86_64-linux-gnu/libpmix.so.2
#4  0x00007ffff655688d in pmix_rte_init ()
   from /lib/x86_64-linux-gnu/libpmix.so.2
#5  0x00007ffff6512d7c in PMIx_Init () from /lib/x86_64-linux-gnu/libpmix.so.2
#6  0x00007ffff660afe4 in ext2x_client_init ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so
#7  0x00007ffff72e1656 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#8  0x00007ffff7a9d11a in orte_init ()
   from /lib/x86_64-linux-gnu/libopen-rte.so.40
#9  0x00007ffff7d6de62 in ompi_mpi_init ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#10 0x00007ffff7d9c17e in PMPI_Init () from /lib/x86_64-linux-gnu/libmpi.so.40
#11 0x00005555555551d6 in main ()

which to be honest I don't fully understand.

My main confusion is that the program is executed properly from machine_1, it connects to the machine_2 without errors and processes are initialized on both machines. But when I try to execute the same command from machine_2, it is not able to connect machine_1. The program is also running correctly if I run it only on machine_2 as well, when decreasing number of processes so it fits in one machine.

Is there anything I am doing wrong? or what could I try to understand better the cause of the problem?

1
for gdb to give any useful information you would have to compile everything for debug.prmottajr
@prmottajr ah yes true, I need -g flag for mpi as well right?Ana Khorguani
There is an answer at stackoverflow.com/questions/329259/… showing how to debug. Also check if the mpi daemon is running on node 1, that could prevent a connection to the node.prmottajr
the question is tagged with mpich but the traces are clearly Open MPI. are you sure mpirun, and the MPI libraries on all the nodes are from the same vendor & version? I assume you are invoking mpirun on machine_1. In that case, what if you simply mpirun --host machine_2 --mca btl ^openib -np 1 ./cannon.run ? As a temporary workaround, you can try export PMIX_MCA_gds=^ds21, and then try again your initial mpirun command line.Gilles Gouaillardet
gds is a PMIX framework that requires one component. The available ones are ds12, hash and the default ds21. Since you are facing an issue with the ds21 component, the workaround blacklist it. Note you can make the change system wide by adding gds = ^ds21 in all your /.../etc/pmix-mca-params.conf. A better option is to use the latest Open MPI (4.0.2) (it can be installed in your $HOME directory and hence do not require root` access to your machines)Gilles Gouaillardet

1 Answers

2
votes

This is indeed a bug in Open PMIx that is addressed at https://github.com/openpmix/openpmix/pull/1580

Meanwhile, a workaround is to blacklist the gds/ds21 component :

  • One option is to

export PMIX_MCA_gds=^ds21

before invoking mpirun

  • An other option is to add the following line
gds = ^ds21

to the PMIx config file located in <pmix_prefix>/etc/pmix-mca-params.conf