MPI kill unwanted processes

Question

I'm using OpenMPI with C bindings. In my code, there is a required number of processes. If MPI is executed such that more processes are opened than are required, I wish to kill or terminate the extra processes. How can I do that?

When I try to do that several ways I can think of, I get the following error:

mpirun has exited due to process rank 3 with PID 24388 on
node pc15-373 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Hristo Iliev Hristo Iliev · Accepted Answer · 2012-12-08T12:05:52

I don't have much to add to what High Performance Mark has already written except the following. You can actually call MPI_FINALIZE and exit processes that come in excess but you have to be aware of the fact that this will disrupt all further collective operations on the world communicator MPI_COMM_WORLD - most of them would simply not complete (with MPI_BARRIER being the one that would certainly hang). To prevent this you might want to first create a new communicator that excludes all unnecessary processes:

int rank, size;    
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

// Obtain the group of processes in the world communicator
MPI_Group world_group;
MPI_Comm_group(MPI_COMM_WORLD, &world_group);

// Remove all unnecessary ranks
MPI_Group new_group;
int ranges[3] = { process_limit, size-1, 1 };
MPI_Group_range_excl(world_group, 1, ranges, &new_group);

// Create a new communicator
MPI_Comm newworld;
MPI_Comm_create(MPI_COMM_WORLD, new_group, &newworld);

if (newworld == MPI_COMM_NULL)
{
   // Bye bye cruel world
   MPI_Finalize();
   exit(0);
}

// From now on use newworld instead of MPI_COMM_WORLD

This code first obtains the group of processes in MPI_COMM_WORLD and then creates a new group that excludes all processes from process_limit onwards. Then it creates a new communicator from the new process group. The MPI_COMM_CREATE operation would return MPI_COMM_NULL in these processes that are not part of the new group and this fact is used to terminate such processes. Given the fact that after this point some of the processes would have "disappeared" from MPI_COMM_WORLD, it is no longer usable for collective operations like broadcasts, barriers, etc. and newworld should be used instead.

Also, as Mark has pointed, on some architectures the extra processes might actually linger around even after they have returned from main. For example on Blue Gene, or Cray, or any other system that uses hardware partitions to manage MPI jobs, the additional resources would not be freed until the whole MPI job has finished. This would also be the case if the program is being run on a cluster or other system under the control of a resource manager (e.g. SGE, LSF, Torque, PBS, SLURM, etc.).

My usual approach to such cases is very pragmatic:

int size, rank;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size != process_limit)
{
   if (rank == 0)
      printf("Please run this program with %d MPI processes\n", process_limit);
   MPI_Finalize();
   exit(1);
}

You could also use MPI_Abort(MPI_COMM_WORLD, 0); instead of MPI_Finalize() to annoy the user :)

You can also use the process spawning features of MPI, but this would made the code more complex as you would have to deal with intercommunicators.

MPI kill unwanted processes

3 Answers