Matrix Multiplication in MPI

Question

I'm trying to create a simple Matrix Multiplication program with MPI, using 1, 2, 4 or 8 processors. My code works for 1 (in which case it only does normal matrix multiplication, I mean, there's no work to split up if you only run on one rank anyway). It also works on 2 and 4 processors. However, when I try using 8 processors (-n 8 on the command line when running the program that is), I don't get the proper value on all places in the matrix c.

Here are examples: If SIZE = 8 (That is, a and b and c are all 8x8 matrices), the resulting matrix is as following:

   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
   0.00   0.00  16.00  16.00  16.00  16.00  16.00  16.00

And if SIZE = 16:

  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
   0.00   0.00   0.00   0.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00

As you can see, zeroes pop up in the lower left. Something that Rank 7 is doing is resulting in those coordinates becoming 0.

I've been staring myself dead at my code now, and I feel like I just need another pair of eyes on them. As far as I can tell, all the sends and receives work properly, all the different tasks gets the value they're supposed to. From what testing I've done, no task actually gives any place in the c matrix a value of 0. I don't know why it happens, how, or what I can do to fix it.

Here is the code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define SIZE 16 /*assumption: SIZE a multiple of number of nodes*/
#define FROM_MASTER 1/*setting a message type*/
#define FROM_WORKER 2/*setting a message type*/
#define DEBUG 1/*1 = debug on, 0 = debug off*/

MPI_Status status;

static double a[SIZE][SIZE];
static double b[SIZE][SIZE];
static double c[SIZE][SIZE];
static double b_to_trans[SIZE][SIZE];
static void init_matrix(void)
{
    int i, j;
    for (i = 0; i < SIZE; i++)
    {
        for (j = 0; j < SIZE; j++) {
            a[i][j] = 1.0;
            if(i >= SIZE/2) a[i][j] = 2.0;
            b_to_trans[i][j] = 1.0;
            if(j >= SIZE/2) b[i][j] = 2.0;
//          c[i][j] = 1.0;
        }
    }
}

static void print_matrix(void)
{
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%7.2f", c[i][j]);
        }
    printf("\n");
    }
}

static void transpose_matrix()
{
    int i, j;
    for(i = 0; i<SIZE; i++)
        for(j = 0; j<SIZE;j++)
            b[i][j] = b_to_trans[j][i];
}

int main(int argc, char **argv)
{
    int myrank, nproc;
    int rows; /*amount of work per node (rows per worker)*/
    int mtype; /*message type: send/recv between master and workers*/
    int dest, src, offseta, offsetb;
    int runthrough, runmod;
    double start_time, end_time;
    int i, j, k, l;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    rows = SIZE/nproc;
    mtype = FROM_MASTER;

    if (myrank == 0) {
        /*Initialization*/
        printf("SIZE = %d, number of nodes = %d\n", SIZE, nproc);
        init_matrix();
        transpose_matrix();
        start_time = MPI_Wtime();

        if(nproc == 1) { /*In case we only run on one processor, the master will simply do a regular matrix-matrix multiplacation.*/
            for(i = 0; i < SIZE; i++) {
                for(j = 0; j < SIZE; j++) {
                    for(k = 0; k < SIZE; k++)
                        c[i][j] = c[i][j] + a[i][k]*b[j][k];
                }
            }
            end_time = MPI_Wtime();
            if(DEBUG) /*Prints the resulting matrix c*/
                print_matrix();
            printf("Execution time on %2d nodes: %f\n", nproc, end_time-start_time);
        }
        else {

            for(l = 0; l < nproc; l++){
                offsetb = rows*l;
                offseta = rows;
                mtype = FROM_MASTER;

                for(dest = 1; dest < nproc; dest++){
                    MPI_Send(&offseta, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&offsetb, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&a[offseta][0], rows*SIZE, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&b[offsetb][0], rows*SIZE, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
                    offseta += rows;
                    offsetb = (offsetb+rows)%SIZE;
                }

                offseta = rows;
                offsetb = rows*l;
                //printf("Rank: %d, offseta: %d, offsetb: %d\n", myrank, offseta, offsetb);
                //printf("Offseta: %d\n", offseta);
                //printf("Offsetb: %d\n", offsetb);
                for(i = 0; i < offseta; i++) {
                    for(j = offsetb; j < offsetb+rows; j++) {
                            for(k = 0; k < SIZE; k++){
                                c[i][j] = c[i][j] + a[i][k]*b[j][k];
                        }
                    }
                }
                mtype = FROM_WORKER;
                for(src = 1; src < nproc; src++){
                    MPI_Recv(&offseta, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    MPI_Recv(&offsetb, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    MPI_Recv(&rows, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    for(i = 0; i < rows; i++) {
                        MPI_Recv(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, src, mtype, MPI_COMM_WORLD, &status); /*returns answer c(1,1)*/
                    }
                }
            }


            end_time = MPI_Wtime();
            if(DEBUG) /*Prints the resulting matrix c*/
                print_matrix();
            printf("Execution time on %2d nodes: %f\n", nproc, end_time-start_time);
        }
    }
    else{
        if(nproc > 1) {
            for(l = 0; l < nproc; l++){
                mtype = FROM_MASTER;
                MPI_Recv(&offseta, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&offsetb, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&rows, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&a[offseta][0], rows*SIZE, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&b[offsetb][0], rows*SIZE, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD, &status);

                for(i = offseta; i < offseta+rows; i++) {
                    for(j = offsetb; j < offsetb+rows; j++) {
                        for(k = 0; k < SIZE; k++){
                            c[i][j] = c[i][j] + a[i][k]*b[j][k];
                        }
                    }
                }

                mtype = FROM_WORKER;
                MPI_Send(&offseta, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                MPI_Send(&offsetb, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                MPI_Send(&rows, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                for(i = 0; i < rows; i++){
                    MPI_Send(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD);
                }
            }
        }
    }
    MPI_Finalize();
    return 0;
}

Any advice would be helpful, thank you on beforehand.

Jens you should add your code in here as well rather than putting it up on pastebin (I'll do it for you right now) so you don't get closed votes — Ahmed Masud
I didn't know if people would like the whole code being up on here, or if they'd complain that it'd flood the page. But thank you a lot either way. — Jens Lomander
Hi Jens, okay so I looked at your code and mpirun definitely produces a problem, I am going to redo a matrix multiplication solution, with comments rather than debug yours, (it's the lower left corner that you are not collating) — Ahmed Masud
I have managed to solve the problem, but you are more than welcome to give it a go, if nothing else but to share an alternate solution ^^ — Jens Lomander

Shibli Shibli · Accepted Answer · 2017-05-12T22:03:38

This is not a definite answer but should certainly help you in debugging.

I made a test by adding the following code to right after where master receives final data from workers. Out of bunch of outputs, I show only important ones. Note that, j+count never exceeds SIZE except for the case when number of processors is 8. This is important because you write to non-allocated memory.

for(i = 0; i < rows; i++) {
    MPI_Recv(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, src, mtype, MPI_COMM_WORLD, &status);
    // I added the following for debugging.            
    if (src == nproc-1)
    {
        printf("src = %i\n", src);
        printf("i = %i\n", offseta+i);
        printf("j = %i\n", offsetb);
        printf("count = %i\n", offseta);
    }
}

np = 2

src = 1
i = 15
j = 8
count = 8

np = 4

src = 3
i = 15
j = 4
count = 12

np = 8

src = 7
i = 15
j = 10
count = 14

Matrix Multiplication in MPI

1 Answers