2
votes

To optimize MPI communication it is important to understand the flow of the whole communication process. This is rather straightforward for synchronous communication, but what about asynchronous communication? As I understand it, it works in one of these two ways:

  1. Rank0 -> Isend -> Rank1 and Rank1 -> Isend -> Rank0
  2. Rank0 -> Irecv -> Rank1 and Rank1 -> Irecv -> Rank0
  3. Rank0 and Rank1 do some computation
  4. Messages are being dispatched to their respective target location
  5. Matching Recv call found! -> write into the given recv-buffer
  6. Rank0 and Rank1 finish their computation and call MPI_Wait for send and receive
  7. MPI_Wait -> communication completed

or

  1. Rank0 -> Isend -> Rank1 and Rank1 -> Isend -> Rank0
  2. Rank0 and Rank1 do some computation
  3. Messages are being dispatched to their respective target location
  4. No matching Recv call found! -> allocate own temporary buffer and write into that
  5. Rank0 and Rank1 finish their computation and call MPI_Recv
  6. Matching MPI_Recv call is found -> temporary buffer is written into the recv-buffer
  7. Rank0 and Rank1 call MPI_Wait
  8. MPI_Wait -> Communication is completed -> the temporary buffer is freed

Is this correct? Do I need to be aware of any other processes that run in the background of MPI to optimize its usage?

1

1 Answers

2
votes

In general, when sending data with MPI, you should always pre-post your receives if possible. That means if you're trying to do communication between two processes, you should do something like this (lots of important arguments left out for brevity):

if (rank == 0) {
  MPI_Irecv(rdata, ..., 1, ..., req[0]);
  ...
  MPI_Isend(sdata, ..., 1, ..., req[1]);
} else {
  MPI_Irecv(rdata, ..., 0, ..., req[0]);
  ...
  MPI_Isend(sdata, ..., 0, ..., req[1]);
}
MPI_Waitall(2, req);

You can do other stuff in between the Irecv and Isend if you like, but by pre-posting the receives, you will save memory and time because the user buffers will be available for the MPI library to put the data that's coming in. If you don't do it in this order and the messages arrive before you call Irecv (or any other flavor of receive), the messages will have to be stored in some other internal buffer first until the receive is posted, then the message will be copied again from the MPI buffer to the user buffer. This may also result in the message not being sent at all until the receive is called if the message is too large to fit in the pre-allocated buffers.

You can call the Irecv as early as you like too. If you want to put the Irecv at the beginning of an iteration, do a bunch of calculation, then call Isend at the end of the iteration when the data is ready, that's fine too.

Cross-talk between other processes is usually not a problem unless you have lots of processes sending messages to one process. In that case, you might end up with some flow control issues, but that's not usually a situation that comes up. Most of the time in that situation, collectives are used instead of point-to-point communication.