MPI memory corruption on specific core counts only

Question

For some background, I'm working on parallelizing a basic PDE solver with MPI. The program takes a domain and assigns each processor a grid covering a portion of it. If I run with a single core or four cores, the program runs just fine. However, if I run with two or three cores, I get a core dump like the following:

*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000018bd540 ***
======= Backtrace: =========
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000022126e0 ***
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc1a63f77e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80dfb)[0x7fc1a6400dfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca753f77e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc1a640453c]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fca753fe9dc]
/usr/lib/libmpi.so.12(+0x25919)[0x7fc1a6d25919]
/lib/x86_64-linux-gnu/libc.so.6(+0x80678)[0x7fca75400678]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x52a9)[0x7fc198fe52a9]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca7540453c]
/usr/lib/libmpi.so.12(ompi_mpi_finalize+0x412)[0x7fc1a6d41a22]
/usr/lib/libmpi.so.12(+0x25919)[0x7fca75d25919]
MeshTest(_ZN15MPICommunicator7cleanupEv+0x26)[0x422e70]
/usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x4381)[0x7fca68844381]
MeshTest(main+0x364)[0x41af2a]
/usr/lib/libopen-pal.so.13(mca_base_component_close+0x19)[0x7fca74c88fe9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc1a63a0830]
/usr/lib/libopen-pal.so.13(mca_base_components_close+0x42)[0x7fca74c89062]
MeshTest(_start+0x29)[0x41aaf9]
/usr/lib/libmpi.so.12(+0x7d3b4)[0x7fca75d7d3b4]
======= Memory map: ========
<insert core dump>

I've been able to trace the errors to when I create a new grid:

Result Domain::buildGrid(unsigned int shp[2], pair2<double> &bounds){
  // ... Unrelated code ...

  // grid is already allocated and needs to be cleared.
  delete grid;                                                                                                         
  grid = new Grid(bounds, shp, nghosts);                                                                                                                                                                                                    
  return SUCCESS;                                                                                                    
}

Grid::Grid(const pair2<double>& bounds, unsigned int sz[2], unsigned int nghosts){
  // ... Code unrelated to memory allocation ...

  // Construct the grid. Start by adding ghost points.
  shp[0] = sz[0] + 2*nghosts;
  shp[1] = sz[1] + 2*nghosts;
  try{
    points[0] = new double[shp[0]];
    points[1] = new double[shp[1]];
    for(int i = 0; i < shp[0]; i++){
      points[0][i] = grid_bounds[0][0] + (i - (int)nghosts)*dx;
    }
    for(int j = 0; j < shp[1]; j++){
      points[1][j] = grid_bounds[1][0] + (j - (int)nghosts)*dx;
    }
  }
  catch(std::bad_alloc& ba){
    std::cout << "Failed to allocate memory for grid.\n";
    shp[0] = 0;
    shp[1] = 0;
    dx = 0;
    points[0] = NULL;
    points[1] = NULL;
  }
}

Grid::~Grid(){
  delete[] points[0];
  delete[] points[1];
}

As far as I know, my MPI implementation is fine, and all the MPI-dependent functionality in the Domain class seems to work correctly. I'm assuming that there's something somewhere illegally accessing memory outside its range, but I have no idea where; at this point, the code literally just initializes MPI, loads some parameters, sets up the grid (with the only memory access occurring during its construction), then calls MPI_Finalize() and returns.

You can use a tool such as valgrind to track out of bound access (be aware that some of the warnings issued by the MPI library can be ignored) — Gilles Gouaillardet
How do you decompose the grid among MPI ranks, it is not clear in the code. — Mahmoud Fayez
Your memory allocation and initialisation of points[][] looks fine to me. The error is likely somewhere in the code that is not shown. — Hristo Iliev
@MahmoudFayez, there's a function that uses MPI_Dim() to figure out how to arrange the processors. It uses the rank to determine exactly where each processor should place the grid, then it calculates the bounds for that grid, the number of grid points, and feeds both into Domain::buildGrid(). @GillesGouaillardet, Thanks, I'll look into the errors from Valgrind and edit my post appropriately. — Jacob Fields

Jacob Fields Jacob Fields · Accepted Answer · 2020-05-24T02:23:09

It turns out there was an error in my Grid constructor while assigning points (it read points[0][j] = ... while assigning the y points) that I somehow caught and corrected while I was copying the code into my post but not in my code. The error only showed up in 2 and 3 core runs because the grid was perfectly square for 1 and 4 core runs, so shp[0] was equal to shp[1]. Thanks, everyone, for the tips. I feel kind of embarrassed now after seeing it was something so simple.

MPI memory corruption on specific core counts only

1 Answers