For some background, I'm working on parallelizing a basic PDE solver with MPI. The program takes a domain and assigns each processor a grid covering a portion of it. If I run with a single core or four cores, the program runs just fine. However, if I run with two or three cores, I get a core dump like the following:
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000018bd540 ***
======= Backtrace: =========
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000022126e0 ***
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc1a63f77e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80dfb)[0x7fc1a6400dfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca753f77e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc1a640453c]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fca753fe9dc]
/usr/lib/libmpi.so.12(+0x25919)[0x7fc1a6d25919]
/lib/x86_64-linux-gnu/libc.so.6(+0x80678)[0x7fca75400678]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x52a9)[0x7fc198fe52a9]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca7540453c]
/usr/lib/libmpi.so.12(ompi_mpi_finalize+0x412)[0x7fc1a6d41a22]
/usr/lib/libmpi.so.12(+0x25919)[0x7fca75d25919]
MeshTest(_ZN15MPICommunicator7cleanupEv+0x26)[0x422e70]
/usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x4381)[0x7fca68844381]
MeshTest(main+0x364)[0x41af2a]
/usr/lib/libopen-pal.so.13(mca_base_component_close+0x19)[0x7fca74c88fe9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc1a63a0830]
/usr/lib/libopen-pal.so.13(mca_base_components_close+0x42)[0x7fca74c89062]
MeshTest(_start+0x29)[0x41aaf9]
/usr/lib/libmpi.so.12(+0x7d3b4)[0x7fca75d7d3b4]
======= Memory map: ========
<insert core dump>
I've been able to trace the errors to when I create a new grid:
Result Domain::buildGrid(unsigned int shp[2], pair2<double> &bounds){
// ... Unrelated code ...
// grid is already allocated and needs to be cleared.
delete grid;
grid = new Grid(bounds, shp, nghosts);
return SUCCESS;
}
Grid::Grid(const pair2<double>& bounds, unsigned int sz[2], unsigned int nghosts){
// ... Code unrelated to memory allocation ...
// Construct the grid. Start by adding ghost points.
shp[0] = sz[0] + 2*nghosts;
shp[1] = sz[1] + 2*nghosts;
try{
points[0] = new double[shp[0]];
points[1] = new double[shp[1]];
for(int i = 0; i < shp[0]; i++){
points[0][i] = grid_bounds[0][0] + (i - (int)nghosts)*dx;
}
for(int j = 0; j < shp[1]; j++){
points[1][j] = grid_bounds[1][0] + (j - (int)nghosts)*dx;
}
}
catch(std::bad_alloc& ba){
std::cout << "Failed to allocate memory for grid.\n";
shp[0] = 0;
shp[1] = 0;
dx = 0;
points[0] = NULL;
points[1] = NULL;
}
}
Grid::~Grid(){
delete[] points[0];
delete[] points[1];
}
As far as I know, my MPI implementation is fine, and all the MPI-dependent functionality in the Domain
class seems to work correctly. I'm assuming that there's something somewhere illegally accessing memory outside its range, but I have no idea where; at this point, the code literally just initializes MPI, loads some parameters, sets up the grid (with the only memory access occurring during its construction), then calls MPI_Finalize()
and returns.
valgrind
to track out of bound access (be aware that some of the warnings issued by the MPI library can be ignored) – Gilles Gouaillardetpoints[][]
looks fine to me. The error is likely somewhere in the code that is not shown. – Hristo IlievMPI_Dim()
to figure out how to arrange the processors. It uses the rank to determine exactly where each processor should place the grid, then it calculates the bounds for that grid, the number of grid points, and feeds both intoDomain::buildGrid()
. @GillesGouaillardet, Thanks, I'll look into the errors from Valgrind and edit my post appropriately. – Jacob Fields