Cray MPI has not supported MPI_Comm_spawn until recently, and its solution to managing resources for spawned MPI jobs is unique. A place-holder job is launched using aprun
to manage the resources used to host spawned jobs, i.e., the cores/nodes that will be hosting the spawned MPI ranks. The set of resources managed by the place-holder job is called a "rank pool", in analogy to a memory pool. Here's how you would set up and use a rank pool:
rankpool.c
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPIX_Comm_rankpool(MPI_COMM_WORLD, "all_nodes", 60);
MPI_Finalize();
}
spawning_app.c
[ ... code goes here ... ]
MPI_Info_create(&info);
MPI_Info_set(info, "rankpool", "all_nodes");
MPI_Comm_spawn("master", argv, num_ranks,
info, 0, comm, &child_comm,
MPI_ERRCODES_IGNORE);
[ ... more code ... ]
If you want to distribute 6 ranks across three nodes, you can launch your rank pool using aprun -n 6 -N 2
, so you have 6 total ranks and 2 ranks per node.
If you want a more specific layout for your spawned ranks, you can reorder the ranks in the communicator that you pass to MPIX_Comm_rankpool
to obtain this effect. For example, if your master job spawns various child jobs each with 4 ranks, and you want the ranks for each child job spread evenly across nodes, you can reorder the ranks in MPI_COMM_WORLD
from this:
MPI_COMM_WOLRD
--------------
node 1 node 2 node 3 node 4
ranks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
to this:
reordered_comm
--------------
node 1 node 2 node 3 node 4
ranks 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
MPIX_Comm_rankpool
will attempt to assign a contiguous set of ranks to each child job, so child jobs will generally have one rank on each node.
For more details on how this all works, see Cray's dynamic process management whitepaper.