The code I am working on uses MPI to split a large 3 dimensional array (a cube) into sub domains along all the three axes to form smaller cubes. I previously had worked on a simpler 2 dimensional equivalent with no issues.
Now since MPI has this annoying habit (or gratifying habit, depending on how you see it) of treating MPI_SEND and MPI_RECV as non-blocking calls for small chunks of data, this migration from 2D to 3D brought a lot struggle. All the calls that worked perfectly for 2D started deadlocking at the slightest provocation in 3D since the data that had to be passed between processes were now 3D arrays and therefore were larger.
After a week of fight and pulling out much hair, a complicated set of MPI_SEND and MPI_RECV calls were constructed that managed to pass data across faces, edges and corners of every cube in the domain (with periodicity and non-periodicity set appropriately at different boundaries) smoothly. The happiness was not to last. After adding a new boundary condition that required an extra path of communication between cells on one side of the domain, the code plunged into another vicious bout of deadlocking.
Having had enough, I decided to resort to non-blocking calls. Now with this much background, I hope my intentions with the below code will be quite plain. I am not including the codes I used to transfer data across edges and corners of the sub-domains. If I can sort out the communication between the faces of the cubes, I would be able to bring everything else to fall neatly into place.
The code uses five arrays to simplify data transfer:
rArr = Array of ranks of neighbouring cells
tsArr = Array of tags for sending data to each neighbour
trArr = Array of tags for receiving data from each neighbour
lsArr = Array of limits (indices) to describe the chunk of data to be sent
lrArr = Array of limits (indices) to describe the chunk of data to be received
Now since each cube has 6 neighbours sharing a face each, rArr, tsArr and trArr each are integer arrays of length 6. The limits array on the other hand is a two dimensional array as described below:
lsArr = [[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 1 sending
[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 2 sending
.
.
[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize]] !for face 6 sending
So a call to send values of the variable dCube across the ith face of a cell (process) will happen as below:
call MPI_SEND(dCube(lsArr(i, 1):lsArr(i, 2), lsArr(i, 3):lsArr(i, 4), lsArr(i, 5):lsArr(i, 6)), lsArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
And another process with the matching destination rank and tag will receive the same chunk as below:
call MPI_RECV(dCube(lrArr(i, 1):lrArr(i, 2), lrArr(i, 3):lrArr(i, 4), lrArr(i, 5):lrArr(i, 6)), lrArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
The lsArr and lrArr of source and destination process were tested to show matching sizes (but different limits). The tags arrays were also checked to see if they matched.
Now my earlier version of the code with blocking calls worked perfectly and hence I am 99% confident about the correctness of values in the above arrays. If there is reason to doubt their accuracy, I can add those details, but then the post will become extremely long.
Below is the blocking version of my code which worked perfectly. I apologize if it is a bit intractable. If it is necessary to elucidate it further to identify the problem, I shall do so.
subroutine exchangeData(dCube)
use someModule
implicit none
integer :: i, j
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
do j = 1, 3
if (mod(edArr(j), 2) == 0) then !!edArr = [xRank, yRank, zRank]
i = 2*j - 1
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
i = 2*j
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
else
i = 2*j
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
i = 2*j - 1
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end if
if (mod(edArr(j), 2) == 0) then
i = 2*j
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
i = 2*j - 1
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
else
i = 2*j - 1
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
i = 2*j
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end if
end do
end subroutine exchangeData
Basically it goes along each direction, x, y and z and first sends data from odd numbered faces and then even numbered faces. I don't know if there is an easier way to do this. This was arrived at after innumerable deadlocking codes that nearly drove me mad. The codes to send data across edges and corners are even longer.
Now comes the actual problem I am having now. I replaced the above code with the following (a bit naively, maybe?)
subroutine exchangeData(dCube)
use someModule
implicit none
integer :: i, j
integer, dimension(6) :: fRqLst
integer :: stLst(MPI_STATUS_SIZE, 6)
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
fRqLst = MPI_REQUEST_NULL
do i = 1, 6
call MPI_IRECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, fRqLst(i), ierr)
end do
do i = 1, 6
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end do
call MPI_WAITALL(6, fRqLst, stLst, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
end subroutine exchangeData
someModule
is a placeholder module that contains all the variables. Actually they are distributed across a series of modules, but I'll gloss over that for now. The main idea was to use non-blocking MPI_IRECV calls to prime every process to receive data and then send data using a series of blocking MPI_SEND calls. However, I doubt that if things were this easy, parallel programming would have been a piece of cake.
This code gives a SIGABRT and exits with a double-free error. Moreover it seems to be a Heisenbug and disappears at times.
Error message:
*** Error in `./a.out': double free or corruption (!prev): 0x00000000010315c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x00000000023075c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x0000000001d465c0 ***
Program received signal SIGABRT: Process abort signal.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Backtrace for this error:
#0 0x7F5807D3C7D7
#1 0x7F5807D3CDDE
#2 0x7F580768ED3F
#3 0x7F580768ECC9
#4 0x7F58076920D7
#5 0x7F58076CB393
#6 0x7F58076D766D
#0 0x7F4D387D27D7
#1 0x7F4D387D2DDE
#2 0x7F4D38124D3F
#3 0x7F4D38124CC9
#4 0x7F4D381280D7
#5 0x7F4D38161393
#0 #6 0x7F4D3816D66D
0x7F265643B7D7
#1 0x7F265643BDDE
#2 0x7F2655D8DD3F
#3 0x7F2655D8DCC9
#4 0x7F2655D910D7
#5 0x7F2655DCA393
#6 0x7F2655DD666D
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#0 0x7FC9DA96B7D7
#1 0x7FC9DA96BDDE
#2 0x7FC9DA2BDD3F
#3 0x7FC9DA2BDCC9
#4 0x7FC9DA2C10D7
#5 0x7FC9DA2FA393
#6 0x7FC9DA30666D
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
I tried searching for similar errors on this site with the '(discriminator 1)' part, but couldn't find any. I also searched for cases where MPI produces double-free memory corruption error and again to no avail.
I must also point out that the line 1542 in the error message corresponds to the blocking MPI_SEND call in my code.
The above error popped when I was using gfortran 4.8.2 with ompi 1.6.5. However, I also tried running the above code with the Intel fortran compiler and received a curious error message:
[21] trying to free memory block that is currently involved to uncompleted data transfer operation
I searched the above error on net and got almost nothing. :( So that was a dead end as well. The full error message is a bit too long, but below is a part of it:
*** glibc detected *** ./a.out: munmap_chunk(): invalid pointer: 0x0000000001c400a0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000001c40410 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000d28c80 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000f51520 ***
[20] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x26bd800 len=3966637480
RTC entry - addr=0x26a9e70 len=148800 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 20
[21] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x951e90 len=2282431520
RTC entry - addr=0x93e160 len=148752 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 21
If my error is some careless one or borne from insufficient knowledge, the above details might suffice. If it is a deeper issue, then I'll gladly elaborate further.
Thanks in advance!