0
votes

Implementing a parallel version of Game of Life using MPI, getting a segmentation fault (signal 11). New to MPI, and couldn't really get valgrind to tell me where exactly the error exists. Simplified my code and found that there exists a problem in the bold snippet.

Edit: marked the block of code where the problem exists

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>


int main(int argc, char *argv[]){

  if(argc!=5)
  printf("Incorrect number of arguments.\n");
  else{
  // program logic here
  int m, n, sum, pid, nprocs;
  char outfilename[16];
  FILE *outfile;
  FILE *infile;
  int r=atoi(argv[3]);
  int c=atoi(argv[4]);
  int gens=atoi(argv[2]);
  int **old, **new, *new1d, *old1d;
  int i,j;
  MPI_Status status;


  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&pid); //initializing MPI here

//prevented segmentation error by using atoi

//domain decomposition start

// problem arisis here
int seg=c/nprocs; //divide by width
int ghost=seg+1;
int row=r+1;
int tsize=ghost*row; //ghost cells

 old1d = malloc(tsize*sizeof(int));
   new1d = malloc(tsize*sizeof(int));
  old   = malloc(r*sizeof(int*));
  new   = malloc(r*sizeof(int*));

   for(i=0; i<ghost; i++){
    old[i] = &old1d[i*row];
    new[i] = &new1d[i*row];
  }
// problem ends 

if(pid==0){
        MPI_Send(&old[0][seg], c, MPI_INT, 1,  0, MPI_COMM_WORLD);
        MPI_Recv(&old[0][ghost],c, MPI_INT, 1,  1, MPI_COMM_WORLD, &status);
        MPI_Send(&old[0][1],   c, MPI_INT, 1,  2, MPI_COMM_WORLD);
        MPI_Recv(&old[0][0],   c, MPI_INT, 1,  3, MPI_COMM_WORLD, &status);
}
else{
        MPI_Recv(&old[0][0],    c, MPI_INT, 0,  0, MPI_COMM_WORLD, &status);
        MPI_Send(&old[0][1],    c, MPI_INT, 0,  1, MPI_COMM_WORLD);
        MPI_Recv(&old[0][ghost],c, MPI_INT, 0,  2, MPI_COMM_WORLD, &status);
        MPI_Send(&old[0][seg], c, MPI_INT, 0,  3, MPI_COMM_WORLD);

}
infile=fopen(argv[1],"r");

if(infile==NULL){
    printf("Could not locate file.\n");
    exit(1);
}

  while(fscanf(infile,"%d %d",&m, &n)!=EOF){
    old[m][n]=1;
  }
  fclose(infile);
//repeat for number of generations
  for(n=0; n<gens; n++){

    for(i=1; i<=r; i++){
      for(j=1; j<=c; j++){

        sum =  old[i-1][j-1] + old[i-1][j] + old[i-1][j+1]
          + old[i][j-1] + old[i][j+1]
          + old[i+1][j-1] + old[i+1][j] + old[i+1][j+1];

        if(sum==2 || sum==3)
            new[i][j]=1;
        else
            new[i][j]=0;
      }
    }
   //copying old state into new state
    for(i=1; i<=r; i++){
      for(j=1; j<=c; j++){
        old[i][j] = new[i][j];
      }
    }
  }

  //create new output file
  sprintf(outfilename,"output_%d",pid);
  outfile=fopen(outfilename,"w");
  for(i=1; i<=r; i++){
    for(j=1; j<=c; j++){
     if(new[i][j]==1){
     fprintf(outfile,"%d\t%d\n",i ,j);
     printf("%d %d",i,j);
     }
    }
  }

  fclose(outfile);
  MPI_Finalize();
  }
  return 0;
}

Edit: The input file life.data.1 has X Y co-ordinates that denote live cells.

1
What did you do to simplify the application in order to get an idea of where the segmentation fault occurs? First simplify, then add more code to detect the exact fault location. - Jeroen Peeters
I implemented a parallel version of this without doing domain decomposition. So basically that's the one without MPI_send / MPI_receive and the part where I'm dividing the domain vertically. - Aditya
Still, add more details of what you've tried and try to be more specific in your question. This is not a platform where you can dump code and have people solve your issue. - Jeroen Peeters
Update your question to indicate where the problem materializes. Please identify the block of code, or line even, where it occurs. Maybe someone will help. - Jeroen Peeters
What is argv[1] supposed to mean? I dont see it referenced anywhere. Hint: C uses zero offset. ( for(i=1; i<=r; i++){ assumes one-offset, too, IMHO. - wildplasser

1 Answers

1
votes

As wildplasser hinted at in this comment, the crash is stemming from the loop you have here (which is past where you indicated the problem is stemming from):

for(i=1; i<=r; i++){
  for(j=1; j<=c; j++){

    sum =  old[i-1][j-1] + old[i-1][j] + old[i-1][j+1]
      + old[i][j-1] + old[i][j+1] //This line causes a segfault
      + old[i+1][j-1] + old[i+1][j] + old[i+1][j+1]; //This line causes a segfault

    if(sum==2 || sum==3)
        new[i][j]=1; //This line causes a segfault
    else
        new[i][j]=0; //This line causes a segfault
  }
}

All of the lines indicated by a comment cause a segfault because of the index being accessed. Your loop gives i and j values 1 through r and 1 through c inclusive, respectively. However, it is only valid to access elements 0 through r-1 in the old array and elements 0 through r-1 in the new array. To fix this, we can change the outer loop to

for(i = 1; i < r-1; i++){
    //...code here...
}

Notably, this will not set every value in new. It's possible you meant to declare old to be size row, which is r+1. In that case, you could have the continuation condition be i < r in this loop. I'm not sure that's the case, though. This handles the errors arising from accessing elements in old.

However, there are still issues arising from the use of old and old1d. old has r (250) elements in it, but the only ghost (63) entries are intialized in this case. old1d never has any of the non-live values initialized, and the non-initialized values are sent with MPI_Send and MPI_Recv. You need to pick how large the old array is and make sure to initialize all of the values in it, and also pick how large the old1d array is and make sure all of its values are initialized. The same goes for new and new1d.

By changing the loops so that old and new are always accessed between indices 0 and ghost-1 inclusive, and changing all of the loops so that each element of old and new (something like old[i] and new[i]) are always accessed between 0 and r-1 inclusive, the program does not crash. This is done by keeping i < ghost (or < ghost - 1 in the main loop) and keeping j < r (or, similarly, < r -1) for each loop. However, this almost certainly does not give the behavior you want. It looks like your current version of the loops are intended for a serial program, and ignore the parallelism you've tried to introduce by splitting the columns up into seg-sized pieces. The loops need to be completely re-designed so that their accesses use the parallelism in the program, and processes communicate so that adjacent cells across different processors have the appropriate communication. This requires a significant overhaul of the program.

There are two more issues that I want to point out:

  1. The program hangs if you run it on more than 2 processes because the sends and receives are hard-coded to only communicate between process 0 and process 1. You need to make sure that processes 2 and above either don't try to send/receive, or are actually communicating when they send/receive.

  2. When you compute seg = c/nprocs, the remainder of this division is being ignored. There are 250 columns, but seg becomes 250/4 = 62 (by integer division), and 62*4 = 248, which is less than the full number of columns. You need to make sure to deal with a non-clean division of columns into processes.