1
votes

The code I am working on uses MPI to split a large 3 dimensional array (a cube) into sub domains along all the three axes to form smaller cubes. I previously had worked on a simpler 2 dimensional equivalent with no issues.

Now since MPI has this annoying habit (or gratifying habit, depending on how you see it) of treating MPI_SEND and MPI_RECV as non-blocking calls for small chunks of data, this migration from 2D to 3D brought a lot struggle. All the calls that worked perfectly for 2D started deadlocking at the slightest provocation in 3D since the data that had to be passed between processes were now 3D arrays and therefore were larger.

After a week of fight and pulling out much hair, a complicated set of MPI_SEND and MPI_RECV calls were constructed that managed to pass data across faces, edges and corners of every cube in the domain (with periodicity and non-periodicity set appropriately at different boundaries) smoothly. The happiness was not to last. After adding a new boundary condition that required an extra path of communication between cells on one side of the domain, the code plunged into another vicious bout of deadlocking.

Having had enough, I decided to resort to non-blocking calls. Now with this much background, I hope my intentions with the below code will be quite plain. I am not including the codes I used to transfer data across edges and corners of the sub-domains. If I can sort out the communication between the faces of the cubes, I would be able to bring everything else to fall neatly into place.

The code uses five arrays to simplify data transfer:

  1. rArr = Array of ranks of neighbouring cells

  2. tsArr = Array of tags for sending data to each neighbour

  3. trArr = Array of tags for receiving data from each neighbour

  4. lsArr = Array of limits (indices) to describe the chunk of data to be sent

  5. lrArr = Array of limits (indices) to describe the chunk of data to be received

Now since each cube has 6 neighbours sharing a face each, rArr, tsArr and trArr each are integer arrays of length 6. The limits array on the other hand is a two dimensional array as described below:

lsArr = [[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 1 sending
         [xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 2 sending
         .
         .
         [xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize]] !for face 6 sending

So a call to send values of the variable dCube across the ith face of a cell (process) will happen as below:

call MPI_SEND(dCube(lsArr(i, 1):lsArr(i, 2), lsArr(i, 3):lsArr(i, 4), lsArr(i, 5):lsArr(i, 6)), lsArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)

And another process with the matching destination rank and tag will receive the same chunk as below:

call MPI_RECV(dCube(lrArr(i, 1):lrArr(i, 2), lrArr(i, 3):lrArr(i, 4), lrArr(i, 5):lrArr(i, 6)), lrArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)

The lsArr and lrArr of source and destination process were tested to show matching sizes (but different limits). The tags arrays were also checked to see if they matched.

Now my earlier version of the code with blocking calls worked perfectly and hence I am 99% confident about the correctness of values in the above arrays. If there is reason to doubt their accuracy, I can add those details, but then the post will become extremely long.

Below is the blocking version of my code which worked perfectly. I apologize if it is a bit intractable. If it is necessary to elucidate it further to identify the problem, I shall do so.

subroutine exchangeData(dCube)
use someModule

implicit none
integer :: i, j
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube

do j = 1, 3
    if (mod(edArr(j), 2) == 0) then    !!edArr = [xRank, yRank, zRank]
        i = 2*j - 1
        call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                      lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)

        i = 2*j
        call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
                      lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
    else
        i = 2*j
        call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
                      lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)

        i = 2*j - 1
        call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                      lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
    end if

    if (mod(edArr(j), 2) == 0) then
        i = 2*j
        call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                      lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)

        i = 2*j - 1
        call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
                      lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
    else
        i = 2*j - 1
        call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
                      lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)

        i = 2*j
        call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                      lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
    end if
end do
end subroutine exchangeData

Basically it goes along each direction, x, y and z and first sends data from odd numbered faces and then even numbered faces. I don't know if there is an easier way to do this. This was arrived at after innumerable deadlocking codes that nearly drove me mad. The codes to send data across edges and corners are even longer.

Now comes the actual problem I am having now. I replaced the above code with the following (a bit naively, maybe?)

subroutine exchangeData(dCube)
use someModule

implicit none
integer :: i, j
integer, dimension(6) :: fRqLst
integer :: stLst(MPI_STATUS_SIZE, 6)
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube

fRqLst = MPI_REQUEST_NULL
do i = 1, 6
    call MPI_IRECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
                        lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, fRqLst(i), ierr)
end do

do i = 1, 6
    call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                       lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end do
call MPI_WAITALL(6, fRqLst, stLst, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
end subroutine exchangeData

someModule is a placeholder module that contains all the variables. Actually they are distributed across a series of modules, but I'll gloss over that for now. The main idea was to use non-blocking MPI_IRECV calls to prime every process to receive data and then send data using a series of blocking MPI_SEND calls. However, I doubt that if things were this easy, parallel programming would have been a piece of cake.

This code gives a SIGABRT and exits with a double-free error. Moreover it seems to be a Heisenbug and disappears at times.

Error message:

*** Error in `./a.out': double free or corruption (!prev): 0x00000000010315c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x00000000023075c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x0000000001d465c0 ***

Program received signal SIGABRT: Process abort signal.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Backtrace for this error:
#0  0x7F5807D3C7D7
#1  0x7F5807D3CDDE
#2  0x7F580768ED3F
#3  0x7F580768ECC9
#4  0x7F58076920D7
#5  0x7F58076CB393
#6  0x7F58076D766D
#0  0x7F4D387D27D7
#1  0x7F4D387D2DDE
#2  0x7F4D38124D3F
#3  0x7F4D38124CC9
#4  0x7F4D381280D7
#5  0x7F4D38161393
#0  #6  0x7F4D3816D66D
0x7F265643B7D7
#1  0x7F265643BDDE
#2  0x7F2655D8DD3F
#3  0x7F2655D8DCC9
#4  0x7F2655D910D7
#5  0x7F2655DCA393
#6  0x7F2655DD666D
#7  0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#7  0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8  0x42EFFB in processgrid_ at solver.f90:431
#9  0x436CF0 in MAIN__ at solver.f90:97
#8  0x42EFFB in processgrid_ at solver.f90:431
#9  0x436CF0 in MAIN__ at solver.f90:97
#0  0x7FC9DA96B7D7
#1  0x7FC9DA96BDDE
#2  0x7FC9DA2BDD3F
#3  0x7FC9DA2BDCC9
#4  0x7FC9DA2C10D7
#5  0x7FC9DA2FA393
#6  0x7FC9DA30666D
#7  0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8  0x42EFFB in processgrid_ at solver.f90:431
#9  0x436CF0 in MAIN__ at solver.f90:97
#7  0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8  0x42EFFB in processgrid_ at solver.f90:431
#9  0x436CF0 in MAIN__ at solver.f90:97

I tried searching for similar errors on this site with the '(discriminator 1)' part, but couldn't find any. I also searched for cases where MPI produces double-free memory corruption error and again to no avail.

I must also point out that the line 1542 in the error message corresponds to the blocking MPI_SEND call in my code.

The above error popped when I was using gfortran 4.8.2 with ompi 1.6.5. However, I also tried running the above code with the Intel fortran compiler and received a curious error message:

[21] trying to free memory block that is currently involved to uncompleted data transfer operation

I searched the above error on net and got almost nothing. :( So that was a dead end as well. The full error message is a bit too long, but below is a part of it:

*** glibc detected *** ./a.out: munmap_chunk(): invalid pointer: 0x0000000001c400a0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000001c40410 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000d28c80 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000f51520 ***
[20] trying to free memory block that is currently involved to uncompleted data transfer operation
 free mem  - addr=0x26bd800 len=3966637480
 RTC entry - addr=0x26a9e70 len=148800 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 20
[21] trying to free memory block that is currently involved to uncompleted data transfer operation
 free mem  - addr=0x951e90 len=2282431520
 RTC entry - addr=0x93e160 len=148752 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 21

If my error is some careless one or borne from insufficient knowledge, the above details might suffice. If it is a deeper issue, then I'll gladly elaborate further.

Thanks in advance!

1
You must assume Send/Recv will block. They don't have to. Using Ssend may help you debug the issue since it will always block until matching occurs. Nonblocking is almost always the right way to do boundary exchange.Jeff Hammond
You might try with a contiguous buffer first. I'm not quite sure how Fortran array slices work. They are supposed to be supported, at least with Fortran 2008 bindings, but they may be buggy.Jeff Hammond
Non-blocking MPI and non-contiguous arrays is very dangerous in Fortran. MPI3 is still not supported well, especially for gfortran. I recommend MPI derived types and pass just the first element of the buffer.Vladimir F
Thanks! The comments were illuminating. Looks like I may have placed too much trust on Fortran's array slicing. Since I am not yet familiar with the use of derived MPI datatypes, I'll copy the non-contiguous array into a separate contiguous array and send it for now and see how it works.Roshan Sam

1 Answers

0
votes

Though the question attracted useful comments, I believe that posting an answer on how the suggestion helped me solve the problem might turn out to be useful for those who may stumble on this post in the future with the same issue.

As pointed out, non-blocking MPI calls with non-contiguous arrays in Fortran - bad idea

I used the idea of copying the non-contiguous array into a contiguous one and using that instead. However, with blocking calls, non-contiguous arrays seem to be behaving well. Since I was using blocking MPI_SEND and non-blocking MPI_IRECV, the code makes only one copy - for just receiving, and continues to send data non-contiguously as before. This seems to be working for now, but if it can cause any hiccups later on, please warn me in the comments.

It does add a lot of repeating lines of code ( ruining the aesthetics :P ). That is mainly because the limits for sending/receiving are not same for all the 6 faces. So the arrays for temporarily storing the data to be received have to allocated (and copied) individually for each of the six faces.

subroutine exchangeData(dCube)
use someModule

implicit none
integer :: i
integer, dimension(6) :: fRqLst
integer :: stLst(MPI_STATUS_SIZE, 6)
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
double precision, allocatable, dimension(:,:,:) :: fx0, fx1, fy0, fy1, fz0, fz1

allocate(fx0(lrArr(1, 1):lrArr(2, 1), lrArr(3, 1):lrArr(4, 1), lrArr(5, 1):lrArr(6, 1)))
allocate(fx1(lrArr(1, 2):lrArr(2, 2), lrArr(3, 2):lrArr(4, 2), lrArr(5, 2):lrArr(6, 2)))
allocate(fy0(lrArr(1, 3):lrArr(2, 3), lrArr(3, 3):lrArr(4, 3), lrArr(5, 3):lrArr(6, 3)))
allocate(fy1(lrArr(1, 4):lrArr(2, 4), lrArr(3, 4):lrArr(4, 4), lrArr(5, 4):lrArr(6, 4)))
allocate(fz0(lrArr(1, 5):lrArr(2, 5), lrArr(3, 5):lrArr(4, 5), lrArr(5, 5):lrArr(6, 5)))
allocate(fz1(lrArr(1, 6):lrArr(2, 6), lrArr(3, 6):lrArr(4, 6), lrArr(5, 6):lrArr(6, 6)))

fRqLst = MPI_REQUEST_NULL
call MPI_IRECV(fx0, lrArr(7, 1), MPI_DOUBLE_PRECISION, rArr(1), trArr(1), MPI_COMM_WORLD, fRqLst(1), ierr)
call MPI_IRECV(fx1, lrArr(7, 2), MPI_DOUBLE_PRECISION, rArr(2), trArr(2), MPI_COMM_WORLD, fRqLst(2), ierr)
call MPI_IRECV(fy0, lrArr(7, 3), MPI_DOUBLE_PRECISION, rArr(3), trArr(3), MPI_COMM_WORLD, fRqLst(3), ierr)
call MPI_IRECV(fy1, lrArr(7, 4), MPI_DOUBLE_PRECISION, rArr(4), trArr(4), MPI_COMM_WORLD, fRqLst(4), ierr)
call MPI_IRECV(fz0, lrArr(7, 5), MPI_DOUBLE_PRECISION, rArr(5), trArr(5), MPI_COMM_WORLD, fRqLst(5), ierr)
call MPI_IRECV(fz1, lrArr(7, 6), MPI_DOUBLE_PRECISION, rArr(6), trArr(6), MPI_COMM_WORLD, fRqLst(6), ierr)

do i = 1, 6
    call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
                       lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end do

call MPI_WAITALL(6, fRqLst, stLst, ierr)
dCube(lrArr(1, 1):lrArr(2, 1), lrArr(3, 1):lrArr(4, 1), lrArr(5, 1):lrArr(6, 1)) = fx0
dCube(lrArr(1, 2):lrArr(2, 2), lrArr(3, 2):lrArr(4, 2), lrArr(5, 2):lrArr(6, 2)) = fx1
dCube(lrArr(1, 3):lrArr(2, 3), lrArr(3, 3):lrArr(4, 3), lrArr(5, 3):lrArr(6, 3)) = fy0
dCube(lrArr(1, 4):lrArr(2, 4), lrArr(3, 4):lrArr(4, 4), lrArr(5, 4):lrArr(6, 4)) = fy1
dCube(lrArr(1, 5):lrArr(2, 5), lrArr(3, 5):lrArr(4, 5), lrArr(5, 5):lrArr(6, 5)) = fz0
dCube(lrArr(1, 6):lrArr(2, 6), lrArr(3, 6):lrArr(4, 6), lrArr(5, 6):lrArr(6, 6)) = fz1
deallocate(fx0, fx1, fy0, fy1, fz0, fz1)
end subroutine exchangeData

This partially nullifies the advantage I sought to gain by storing ranks and tags in arrays. I did that mainly to put the sending and receiving calls in a loop. With this fix, only the send calls can be put in loop.

Since allocating and deallocating in each call of the subroutine can waste time, the arrays can be put in a module and allocated at the beginning of the code. The limits don't change in each call.

When the same method is applied for corners and edges as well, it does bloat the code a bit, but it seems to be working. :)

Thanks for the comments.