0
votes

I wish to discover the cause of an error in an MPI program. The program is a big while loop such that for each iteration, a set of message passing is done between each processor and its neighbors using ISEND and IRECV as follows:

while ( t< a very large number ) ...

do i=1,8
 if ( something that is almost always true ) then
   call MPI_ISEND(A,A_buffer,inewtype,neighrank(i),2,MPI_COMM_WORLD,isend,ierr)
   call MPI_WAIT(isend,istatus,ierr)
   call MPI_ISEND(B,B_buffer,MPI_INTEGER4,neighrank(i),3,MPI_COMM_WORLD,isend,ierr)
   call MPI_WAIT(isend,istatus,ierr)
 end if
end do

do i=1,8
 if ( something that is almost always true) then
   call MPI_IRECV(C,C_buffer,inewtype,neighrank(i),2,MPI_COMM_WORLD,irecv,ierr)
   call MPI_WAIT(irecv,istatus,ierr)
   call MPI_IRECV(D,D_buffer,MPI_INTEGER4,neighrank(i),3,MPI_COMM_WORLD,irecv,ierr)
   call MPI_WAIT(irecv,istatus,ierr)
   end if
end do
...

The program produces a segmentation fault error after a very large number of iterations. At each iteration, the same amount of data are message passed among the processors, but the number of calls to ISEND and IRECV is adjustable (i.e. use 80 calls to pass 80kb total or 40 calls to pass 160kb total). If the number of calls is small the program crashes earlier.

I am suspecting that something about InfiniBand! is causing this error, but I do not get an insufficient virtual memory - so it cannot possibly be InfiniBand? What can possibly cause this error?

1
Why on Earth do you use a combination of MPI_I(SEND/RECV) immediately followed by MPI_WAIT when a simple MPI_SEND/RECV would do exactly the same? Then I would recommend that you compile your program with debugging enabled and then examine the core file to find where the crash occurs.Hristo Iliev
Thanks; I read that MPI_I(SEND/RECV) are typically faster.Pippi
MPI_ISEND and MPI_IRECV belong to the class of non-blocking communication operations that execute in the background. They allow you to do computations while the communication takes place (e.g. between MPI_ISEND and MPI_WAIT), which often leads to faster overall program execution. But the operations themselves are as fast as their blocking counterparts.Hristo Iliev
It's my understanding that MPI_ISEND/RECV also helps prevent deadlock that can occur with buffer overflow when using the regular MPI_SEND/RECV.bob.sacamento

1 Answers

1
votes

The MPI code turned out to be fine. It was hard to tell because the program takes 1-2 hours to run before running into Segmentation Fault. Rigorous debugging point out to a non-MPI related bug.