I'm writing an application using MPI (mpi4py actually). The application may spawn some new processes using MPI_Comm_spawn()
(collectively on all current processes) and some nodes from the parent group/communicator may send data to some nodes in the child group/communicator and vice versa. (Notice MPI_Comm_spawn()
and data sending/receving are happening in different threads both for functionality [there are other functionalities not directly relevant to this question so I didn't describe] and performance.)
Because the MPI_Comm_spawn()
function may be called for several times and I expect all nodes can communicate with each other, I currently plan to use MPI_Intercomm_merge()
to merge the two groups (parent and child) into one intracommunicator, and then send data through the new intracommunicator (and the next MPI_Comm_spawn()
will happen on the new intracommunicator).
However, because the spawn and merge process happens during the program running, there will be some data sent through the old communicator already (but may not have yet been received by the dest). How could I safely switch from the old communicator to the new communicator (e.g. be able to delete the old communicator[s] at some point) while losing the least performance? The MPI_Comm_merge()
is the only way I know to guarentee all processes can send data to each other (because if we don't merge, the next time we call MPI_Comm_merge()
, some processes can't directly send data to each other), and I don't mind to change it to another method as long as it works well.
For example, in the following chart, process A, B, C are initial processes (mpiexec -np 3
), D is a spawned process:
A and B will send continous data to C; during the sending time, D is spawned; then C sends data to D. Suppose the old communicator A, B and C uses is comm1
and the merged intracommunicator is comm2
.
What I want to achieve is to send data through comm1
initially, and (all processes) switch to comm2
after D is spawned. What lacks is a mechanism to know when can C safely switch from comm1
to comm2
to receive data from A and/or B, and then I can safely call MPI_Comm_free(comm1)
.
Simply sending a special tag through comm1
at the time of switch would be the last option because C don't know how many processes will send data to it. It does know how many groups of processes will send data to it, so this can be achieved by introducing local leaders (but I'd like to know about other options).
Because A
, B
and C
are processing in parellel and send
/recv
and spawn
are happening in different threads, we can't guarentee no pending data when we call MPI_Comm_spawn()
. E.g. if we imagine A
and B
process send
and C
processes recv
at a same rate, when they call comm_spawn
, C
has only received half of the data from A
and B
, so we can't drop comm1
at C
yet, but have to wait until C
has received all pending data from comm1
(which is an unknown number of messages).
Are there any mechanisms provided by MPI or mpi4py (e.g. error codes or exceptions) to achieve this?
By the way, if my approach is apparently bad or if I misunderstand what MPI_Comm_free()
does, please point out.
(What I understand is that MPI_Comm_free()
is not a collective call; after calling MPI_Comm_free(comm1)
, no more send/recv calls to comm1
is allowed on the same node which calls MPI_Comm_free(comm1)
)