I am new to MPI and I am trying to write an implementation of Fox's algorithm (AxB=C where A and B are matrices of dimension nxn). My program works fine but I would like to see if I can speed it up specifically by overlapping the communication during the shifting of the the blocks in matrix B with the computation of the product matrices (the block matrices of B are shifted cyclically up in the algorithm). Each process in the 2D Cartesian grid has a block from matrices A, B and C as per the algorithm. What I currently have is this, which is inside Fox's algorithm
if (stage > 0){
//shifting b values in all proccess
MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);
MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
MPI_Wait(&my_request1, &status);
MPI_Wait(&my_request2, &status);
multiplyMatrix(a_temp,b,c,n_local);
}
The submatrices a_temp, b, b_temp are pointers of type double that point to chunks n/numprocess*n/numprocesses (this is the size of the block matrices e.g. b = (double *) calloc(n/numprocess*n/numprocesses, sizeof(double)) ).
I would like to have the multiplyMatrix function before the MPI_Wait calls (that would constitute the overlapping of communication and computation) but I am not sure how to do that. Do I need to have two separate buffers and alternate between them at different stages?
(I know I can use MPI_Sendrecv_replace but that does not help with overlapping since it uses blocking send and receive. The same is true for MPI_Sendrecv)