How to safely switch to another communicator in MPI

Question

I'm writing an application using MPI (mpi4py actually). The application may spawn some new processes using MPI_Comm_spawn() (collectively on all current processes) and some nodes from the parent group/communicator may send data to some nodes in the child group/communicator and vice versa. (Notice MPI_Comm_spawn() and data sending/receving are happening in different threads both for functionality [there are other functionalities not directly relevant to this question so I didn't describe] and performance.)

Because the MPI_Comm_spawn() function may be called for several times and I expect all nodes can communicate with each other, I currently plan to use MPI_Intercomm_merge() to merge the two groups (parent and child) into one intracommunicator, and then send data through the new intracommunicator (and the next MPI_Comm_spawn() will happen on the new intracommunicator).

However, because the spawn and merge process happens during the program running, there will be some data sent through the old communicator already (but may not have yet been received by the dest). How could I safely switch from the old communicator to the new communicator (e.g. be able to delete the old communicator[s] at some point) while losing the least performance? The MPI_Comm_merge() is the only way I know to guarentee all processes can send data to each other (because if we don't merge, the next time we call MPI_Comm_merge(), some processes can't directly send data to each other), and I don't mind to change it to another method as long as it works well.

For example, in the following chart, process A, B, C are initial processes (mpiexec -np 3), D is a spawned process:

A and B will send continous data to C; during the sending time, D is spawned; then C sends data to D. Suppose the old communicator A, B and C uses is comm1 and the merged intracommunicator is comm2.

What I want to achieve is to send data through comm1 initially, and (all processes) switch to comm2 after D is spawned. What lacks is a mechanism to know when can C safely switch from comm1 to comm2 to receive data from A and/or B, and then I can safely call MPI_Comm_free(comm1). Simply sending a special tag through comm1 at the time of switch would be the last option because C don't know how many processes will send data to it. It does know how many groups of processes will send data to it, so this can be achieved by introducing local leaders (but I'd like to know about other options).
Because A, B and C are processing in parellel and send/recv and spawn are happening in different threads, we can't guarentee no pending data when we call MPI_Comm_spawn(). E.g. if we imagine A and B process send and C processes recv at a same rate, when they call comm_spawn, C has only received half of the data from A and B, so we can't drop comm1 at C yet, but have to wait until C has received all pending data from comm1 (which is an unknown number of messages).

Are there any mechanisms provided by MPI or mpi4py (e.g. error codes or exceptions) to achieve this?

By the way, if my approach is apparently bad or if I misunderstand what MPI_Comm_free() does, please point out.
(What I understand is that MPI_Comm_free() is not a collective call; after calling MPI_Comm_free(comm1), no more send/recv calls to comm1 is allowed on the same node which calls MPI_Comm_free(comm1))

Gilles Gouaillardet Gilles Gouaillardet · Accepted Answer · 2017-06-27T11:45:08

so basically, C invokes MPI_Comm_spawn(..., MPI_COMM_SELF, ...)

why don't you have {A,B,C} invoke MPI_Comm_spawn(..., comm1, ...) instead ?

MPI_Intercomm_merge() is a collective operation, so you need to "synchronize" your tasks somehow, so why not "synchronize" them before MPI_Comm_spawn() instead ?

then switching to the new communicator is trivial

How to safely switch to another communicator in MPI

1 Answers