I am running my MPI program on a Intel Sandy Bridge cluster, on a 16 nodes partition. There as two processors per node and 8 cores per processor. I started a run with "mpirun -n 256 ./myprogram". Now I need a representative process on each node report the power consumed by the two processors of that node (using RAPL). My question is how to select that process. For example, if it is guaranteed that processes will be assigned to the nodes as 1-16, 17-32, 33-48, etc. then I can just check the MPI rank of a process and decide whether it should report the power. Can numactl be used to bind a large number of processes across multiple nodes?
2 Answers
If you use an MPI 3.x implementation, you may use MPI_Comm_split_type
[1] and MPI_COMM_TYPE_SHARED
as split_type
parameter. This will split the communicator (in your case, certainly MPI_COMM_WORLD
) into subcommunicators which are exactly the shared memory regions of your cluster. Then you have a local root per subcommunicator which can be your representative process.
[1] Pages 247-248, MPI: A Message-Passing Interface Standard (Version 3.0 or 3.1)
The question of placing specific ranks on specific nodes isn't covered (AFAIK) by the MPI standard. However, each implementation and/or machine can/is likely to propose this feature. This can be achieved, for example, via options to the MPI launcher (mpirun
, mpiexec
, srun
, prun
, ortrun
, [add here your preferred mpi launcher], ...), or via the batch scheduler if applicable. So for getting this specific information, I encourage you to refer to your MPI library documentation, your batch scheduler documentation, or your machine documentation.
However, the feature you want is independent from actual MPI process placement. You can very easily implement it to work, irrespective of how processes are placed on your compute nodes. This could be achieve this way:
- Enquire on the name of the node the current process runs on, via
MPI_Get_processor_name()
,gethostname()
or any other mean you feel adequate.MPI_Get_processor_name()
being MPI standard, I would recommend it for portability reason. - Collect the values through a
MPI_Allgather()
for each process to know each-other's node name. - For process of smallest rank running on each node, do whatever you need for reporting the power measurement.
This could look like this:
#include <mpi.h>
#include <iostream>
#include <cstring>
bool amIFirstOnNode() {
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
char names[size][MPI_MAX_PROCESSOR_NAME];
int len;
MPI_Get_processor_name( names[rank], &len );
MPI_Allgather( MPI_IN_PLACE, 0, 0, names[0], MPI_MAX_PROCESSOR_NAME, MPI_CHAR, MPI_COMM_WORLD );
int lower = 0;
while ( std::strncmp( names[rank], names[lower], MPI_MAX_PROCESSOR_NAME ) != 0 ) {
lower++;
}
return lower == rank;
}
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
int rank;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
bool nodeMaster = amIFirstOnNode();
if ( nodeMaster ) {
std::cout << "[" << rank << "] master process on the node\n";
}
else {
std::cout << "[" << rank << "] not master process on the node\n";
}
MPI_Finalize();
return 0;
}
Regarding the use of numactl
across nodes, again this is something which is doable, but highly dependent of your environment. For example, to run on my own environment of dual-socket nodes, with one MPI process per socket/NUMA node, I sometimes use this numa_bind.sh
script of mine:
#!/bin/bash
PPN=2
numactl --cpunodebind=$(( $PMI_ID % $PPN )) --membind=$(( $PMI_ID % $PPN )) "$@"
which is called this way:
mpirun -ppn 2 numa_bind.sh my_mpi_binary [my_mpi_binary options]
But of course, this supposes that your environment sets PMI_ID
or an equivalent...