Issue on sending a matrix using non-blocking MPI functions

Question

The following code creates a Matrix [m][n] using double pointer malloc method and sends equal number of chunks of the matrix to each one of n-1 processors using non-blocking MPI functions. Processor P=0 is responsible for generating the matrix and sending them such that each one of P != 0 processors will receive a set of rows and process them.

The code does not work even though I have spent days to make sure every line is correct but I don't know where the bugs come from :( I appreciate any help.

#include <stdio.h> 
#include <string.h> 
#include <time.h>
#include "mpi.h"

int main (int argc, char* argv[]) {

    const int RANK_0 = 0; // Rank 0
    const int ROWS = 24; // Row size
    const int COLS = 12; // Column size
    const int TAG_0 = 0; // Message ID 
    const int TAG_0 = 0; // Message ID 
    int rank; // The process ID 
    int P; // Number of Processors 

    /* MPI Initialisation */
    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
    MPI_Comm_size(MPI_COMM_WORLD, &P);

    /* Each client processor receives ROWS/P set of arrays */
    if(rank != RANK_0){

        int i,j;
        int chunckSize= ROWS/P;

        MPI_Request *req[chunckSize]; // Requests
        MPI_Request *req1[chunckSize]; // Requests
        MPI_Status status[chunckSize];
        int ptr[chunckSize]; 

        int **buffRecv= malloc(chunckSize * sizeof(int *));

        for (i = 0; i < chunckSize ; i++) {
            buffRecv[i] = malloc(COLS * sizeof(int));

            MPI_Irecv(&ptr[i], 1, MPI_INT, RANK_0, TAG_1, MPI_COMM_WORLD, req1[i]);
            MPI_Irecv(buffRecv[i], COLS, MPI_INT, RANK_0, TAG_0, MPI_COMM_WORLD, req[i]);
            MPI_Wait(req1[i], MPI_STATUSES_IGNORE);
            MPI_Wait(req[i], MPI_STATUSES_IGNORE);  
        }

        printf("\n ===> Processor %d has recieved his set of rows, now start calculation: \n", rank);

        for(i = 0; i< chunckSize; i++){
          // print arrays row by row or do something

        }

        printf("\n Rank %d has done its tasks \n", rank);   


    } 
    else 
    {
        /* MASTER PROCESS*/

        int n=0;
        int k,i,j,dest,offset;
        int inc=1;
        MPI_Request *req[ROWS]; // Requests
        MPI_Request *req1[ROWS]; // Requests
        int chunkSize= ROWS/P;

        int **buf= malloc(ROWS * sizeof(int *));

        offset = chunkSize;
        for(dest = P; dest >= 0; dest--){

            // ROWS/P rows to each destination
            for (i = n; i < offset; i++)
            {
                buf[i] = malloc(COLS * sizeof(int));

                for (j = 0; j < COLS; j++)
                {
                    buf[i][j]=1;
                }

                if(dest == 0)
                {

                   // rank_0 chunk will be handled here
                }

                else
                {
                    MPI_Isend(&i, 1, MPI_INT, dest, TAG_1, MPI_COMM_WORLD, req1[i]); 
                    MPI_Isend(buf[i], COLS, MPI_INT, dest, TAG_0, MPI_COMM_WORLD, req[i]);
                }

             }

            // Print the result after each ROWS/P rows is sent
             if(dest != 0){
                 printf("Row[%d] to Row[%d] is sent to rank# %d\n", n, k, dest);
             } 

            n=offset;
            offset= offset + chunkSize;

        }
    } 

    MPI_Finalize();
}

Gilles Gilles · Accepted Answer · 2015-09-12T16:32:50

There are many issues in this code, which I'll try to enumerate later. But the most important one I believe is that the sending requested are never waited for, and re-utilised from one destination to the next. This is very wrong and since there is no testing or waiting point, the sending actions are likely to never happen. I'll leave you with that for now and edit my answer slowly.

Edit: Ok, now let's progress step by step:

The memory management: since you plan to distribute chunks of data to your processes, it is better to maximise the size of each transfer, and therefore to minimise the number of transfers. But to transfer several rows of your matrix inn one go, you need the data to be stored contiguously in memory. To achieve that while keeping the [i][j] double bracket access simplicity, you need to: first allocate the whole storage you need for your data, and second, to allocate a pointer of pointers to this data, which you will make point on each starting index of each row... This will look like this:
```
int **matrix = malloc( ROWS * sizeof( int* ) );
matrix[0] = malloc( COLS * ROWS * sizeof( int ) );
for ( int i = 1; i < ROWS; i++ ) {
    matrix[i] = matrix[i-1] + COLS;
}
```
This is far from being the main issue but that's a good trick for another time.
The request issue: as already mentioned, your sending requests are not waited for and that is wrong. No MPI transaction is completed until you either waited for it with a MPI_Wait() or MPI_Waitall(), or after you checked it sufficiently with one of the MPI_Testxxx() functions. The simplest is here to use a MPI_Waitall()
What about process #0? It sends to itself, but never will it receive what was sent...
I didn't check the chunk sizes and offsets, but I'm pretty sure that if the number of processes doesn't divide the number of rows, you'll be in trouble.
Finally (hopefully), what you tried to do here corresponds very much to a MPI_Scatter() or possibly a MPI_Scatterv(). Now that your memory is stored linearly, have a look at it and that should just solve your problem.

Hope this helps.

Issue on sending a matrix using non-blocking MPI functions

1 Answers