MPI matrix multiplication, process not cleaning up

Question

Im trying to use MPI to multiply two nxn matrices. The second matrix (bb) is broadcasted to all "slaves" and then it is sent a row from the first matrix (aa) to compute the product. It then sends back the answer to the master process and is stored in the product matrix cc. For some reason I'm getting the error:

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

I believe that the master process is recieving all the messages sent by slave process and vice-versa so I'm not sure what is going on here... any ideas?

Main:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/times.h>
#define min(x, y) ((x)<(y)?(x):(y))
#define MASTER 0

double* gen_matrix(int n, int m);
int mmult(double *c, double *a, int aRows, int aCols, double *b, int bRows, int bCols);

int main(int argc, char* argv[]) {
    int nrows, ncols;
    double *aa;     /* the A matrix */
    double *bb;     /* the B matrix */
    double *cc1;    /* A x B computed */
    double *buffer; /* Row to send to slave for processing */
    double *ans;    /* Computed answer for master */
    int myid, numprocs;
    int i, j, numsent, sender;
    int row, anstype;
    double starttime, endtime;
    MPI_Status status;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    if (argc > 1) {
        nrows = atoi(argv[1]);
        ncols = nrows;
        if (myid == 0) {
            /* Master Code */
            aa = gen_matrix(nrows, ncols);
            bb = gen_matrix(ncols, nrows);
            cc1 = malloc(sizeof(double) * nrows * nrows);
            starttime = MPI_Wtime();
            buffer = (double*)malloc(sizeof(double) * ncols);
            numsent = 0;
            MPI_Bcast(bb, ncols*nrows, MPI_DOUBLE, MASTER, MPI_COMM_WORLD); /*broadcast bb to all slaves*/
            for (i = 0; i < min(numprocs-1, nrows); i++) {                  /*for each process or row*/
                for (j = 0; j < ncols; j++) {                               /*for each column*/
                    buffer[j] = aa[i * ncols + j];                          /*get row of aa*/
                }
                MPI_Send(buffer, ncols, MPI_DOUBLE, i+1, i+1, MPI_COMM_WORLD); /*send row to slave*/
                numsent++;                                                     /*increment number of rows sent*/
            }
            ans = (double*)malloc(sizeof(double) * ncols);
            for (i = 0; i < nrows; i++) {
                MPI_Recv(ans, ncols, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
                         MPI_COMM_WORLD, &status);
                sender = status.MPI_SOURCE;
                anstype = status.MPI_TAG;

                for (i = 0; i < ncols; i++){
                    cc1[(anstype-1) * ncols + i] = ans[i];
                }

                if (numsent < nrows) {
                    for (j = 0; j < ncols; j++) {
                        buffer[j] = aa[numsent*ncols + j];
                    }
                    MPI_Send(buffer, ncols, MPI_DOUBLE, sender, numsent+1,
                             MPI_COMM_WORLD);
                    numsent++;
                } else {
                    MPI_Send(MPI_BOTTOM, 0, MPI_DOUBLE, sender, 0, MPI_COMM_WORLD);
                }
            }

            endtime = MPI_Wtime();
            printf("%f\n",(endtime - starttime));
        } else {
            /* Slave Code */
            buffer = (double*)malloc(sizeof(double) * ncols);
            bb = (double*)malloc(sizeof(double) * ncols*nrows);
            MPI_Bcast(bb, ncols*nrows, MPI_DOUBLE, MASTER, MPI_COMM_WORLD); /*get bb*/
            if (myid <= nrows) {
                while(1) {
                    MPI_Recv(buffer, ncols, MPI_DOUBLE, MASTER, MPI_ANY_TAG, MPI_COMM_WORLD, &status); /*recieve a row of aa*/
                    if (status.MPI_TAG == 0){
                        break;
                    }

                    row = status.MPI_TAG; /*get row number*/
                    ans = (double*)malloc(sizeof(double) * ncols);
                    for (i = 0; i < ncols; i++){
                        ans[i]=0.0;
                    }
                    for (i=0; i<nrows; i++){
                        for (j = 0; j < ncols; j++) { /*for each column*/
                            ans[i] += buffer[j] * bb[j * ncols + i];
                        }
                    }
                    MPI_Send(ans, ncols, MPI_DOUBLE, MASTER, row, MPI_COMM_WORLD);
                }
            }
        } /*end slave code*/
    } else {
        fprintf(stderr, "Usage matrix_times_vector <size>\n");
    }
    MPI_Finalize();
    return 0;
}

In your final send you send only 0 reals, while waiting for ncols, this might cause problems. Better use MPI_Probe beforehand to figure out the size of the message. Better yet, figure out the partitioning beforehand so you know how many rows to receive. — haraldkl

Gilles Gilles · Accepted Answer · 2015-09-23T06:21:30

This error message typically means that one at least of your MPI processes crashed and the whole MPI job subsequently aborted. It can be caused by any sort of error, but most of the time, it's a segmentation fault caused by an erroneous memory access.

I didn't look closely at the code, so I have no idea if the logic works etc, but what I can tell is that this line has an issue:

MPI_Recv(&ans, nrows, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
         MPI_COMM_WORLD, &status);

Indeed, there are two problems here:

&ans is a **double, which is not what you want, I guess you wanted ans
ans hasn't been allocated so it cannot be used as a receiving buffer

Try first to fix this and see what happens.

EDIT: on your new code you allocate ans like this:

ans = (double*)malloc(sizeof(double) * ncols);

then you initialise it like this:

for (i = 0; i < nrows; i++) {
    ans[i]=0.0;
}

And use it like this:

MPI_Send(ans, nrows, MPI_DOUBLE, MASTER, row, MPI_COMM_WORLD);

or

MPI_Recv(ans, nrows, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
         MPI_COMM_WORLD, &status);

This isn't coherent: is ans's size ncols or nrows?

And what is your new error message?

MPI matrix multiplication, process not cleaning up

1 Answers