MPI Fox's algorithm non blocking send and receive

Question

I am new to MPI and I am trying to write an implementation of Fox's algorithm (AxB=C where A and B are matrices of dimension nxn). My program works fine but I would like to see if I can speed it up specifically by overlapping the communication during the shifting of the the blocks in matrix B with the computation of the product matrices (the block matrices of B are shifted cyclically up in the algorithm). Each process in the 2D Cartesian grid has a block from matrices A, B and C as per the algorithm. What I currently have is this, which is inside Fox's algorithm

if (stage > 0){  


   //shifting b values in all proccess

    MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
    MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);   
    MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);                         
    MPI_Wait(&my_request1, &status);
    MPI_Wait(&my_request2, &status);
    multiplyMatrix(a_temp,b,c,n_local);
}

The submatrices a_temp, b, b_temp are pointers of type double that point to chunks n/numprocess*n/numprocesses (this is the size of the block matrices e.g. b = (double *) calloc(n/numprocess*n/numprocesses, sizeof(double)) ).

I would like to have the multiplyMatrix function before the MPI_Wait calls (that would constitute the overlapping of communication and computation) but I am not sure how to do that. Do I need to have two separate buffers and alternate between them at different stages?

(I know I can use MPI_Sendrecv_replace but that does not help with overlapping since it uses blocking send and receive. The same is true for MPI_Sendrecv)

YsK YsK · Accepted Answer · 2013-03-19T17:43:06

I actually figured out how to do this. This question should probably be removed. But since I am new to MPI I will post these solutions here and if anyone has suggestions for improvements I would be happy if they share them. Method 1:

// Fox's algorithm
 double * b_buffers[2];
 b_buffers[0] = (double *) malloc(n_local*n_local*sizeof(double));
 b_buffers[1] = b;
 for (stage =0;stage < q; stage++){
       //copying a into a_temp and Broadcasting a_temp of each proccess to all other proccess in its row
        for (i=0;i< n_local*n_local; i++)
            a_temp[i]=a[i];
        if (stage == 0) {
           MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
           multiplyMatrix(a_temp,b,c,n_local);
           MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
           MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
           MPI_Wait(&my_request2, &status);
           MPI_Wait(&my_request1, &status);
      }


       if (stage > 0)
       {        
           //shifting b values in all procces
            MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
            MPI_Isend(b_buffers[(stage)%2], n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
            MPI_Irecv(b_buffers[(stage+1)%2], n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
                multiplyMatrix(a_temp, b_buffers[(stage)%2], c, n_local);           
            MPI_Wait(&my_request2, &status);
            MPI_Wait(&my_request1, &status);

     }      
}

Method 2:

// Fox's algorithm

 for (stage =0;stage < q; stage++){
       //copying a into a_temp and Broadcasting a_temp of each proccess to all other proccess in its row
        for (i=0;i< n_local*n_local; i++)
            a_temp[i]=a[i];
        if (stage == 0) {
           MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
           multiplyMatrix(a_temp,b,c,n_local);
           MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
           MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
           MPI_Wait(&my_request2, &status);
           MPI_Wait(&my_request1, &status);
      }


       if (stage > 0)
       {        
           //shifting b values in all proccess
            memcpy(b_temp, b, n_local*n_local*sizeof(double));
                MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
            MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);   
                MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
                multiplyMatrix(a_temp, b_temp, c, n_local);         
               MPI_Wait(&my_request2, &status);
                MPI_Wait(&my_request1, &status);

     }

Both of these seem to work, but as I said I am new to MPI and if you have any comments or suggestions please share.

MPI Fox's algorithm non blocking send and receive

1 Answers