2
votes

Suppose I have a master program, which is basically a 1 rank mpi which uses MPI spawn to spawn 5 worker programs.

Now, if I execute my master using the following command

aprun -n 1 -N 1 master

The total number of ranks after spawning will be 6. But will all the 6 ranks be running on the same node? Is there anyway I can distribute the 6 among 3 nodes?

I can exactly one copy of the master process and 5 worker processes.

1

1 Answers

3
votes

Cray MPI has not supported MPI_Comm_spawn until recently, and its solution to managing resources for spawned MPI jobs is unique. A place-holder job is launched using aprun to manage the resources used to host spawned jobs, i.e., the cores/nodes that will be hosting the spawned MPI ranks. The set of resources managed by the place-holder job is called a "rank pool", in analogy to a memory pool. Here's how you would set up and use a rank pool:

rankpool.c

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);
    /* Name this rank pool "all_nodes", which will be
     * used by MPI_Comm_spawn to identify it. */
    MPIX_Comm_rankpool(MPI_COMM_WORLD, "all_nodes", /* 60 seconds timeout */ 60);
    MPI_Finalize();
}

spawning_app.c

[ ... code goes here ... ]

MPI_Info_create(&info);
/* key = "rankpool", value = "all_nodes" */
MPI_Info_set(info, "rankpool", "all_nodes");
MPI_Comm_spawn("master", argv, num_ranks,
               info, 0, comm, &child_comm,
               MPI_ERRCODES_IGNORE);

[ ... more code ... ]

If you want to distribute 6 ranks across three nodes, you can launch your rank pool using aprun -n 6 -N 2, so you have 6 total ranks and 2 ranks per node.

If you want a more specific layout for your spawned ranks, you can reorder the ranks in the communicator that you pass to MPIX_Comm_rankpool to obtain this effect. For example, if your master job spawns various child jobs each with 4 ranks, and you want the ranks for each child job spread evenly across nodes, you can reorder the ranks in MPI_COMM_WORLD from this:

            MPI_COMM_WOLRD
            --------------
        node 1        node 2        node 3          node 4
ranks   0  1  2  3    4  5  6  7    8  9  10  11    12  13  14  15

to this:

            reordered_comm
            --------------
        node 1         node 2         node 3          node 4
ranks   0  4  8  12    1  5  9  13    2  6  10  14    3  7  11  15

MPIX_Comm_rankpool will attempt to assign a contiguous set of ranks to each child job, so child jobs will generally have one rank on each node.

For more details on how this all works, see Cray's dynamic process management whitepaper.