1
votes

The following sequence of errors is received when I try to run a problem on four processors. The MPI command I use is mpirun -np 4

I apologize for posting the error message as is (Primarily due a lack of knowledge on deciphering the information given). Would appreciate your input on the following:

  1. What does the error message mean? At what point does one receive it? Is it because of the system memory (hardware) or is it due to a communication error (something related to MPI_Isend/Irecv?, i.e. Software issue).

  2. Finally, how do I fix this?

Thanks!

ERROR message received follows below:: - - *PLEASE NOTE: This error is received only when the time is large*. Code computes fine when time required to compute data is small (i.e, 300 time steps compared to 1000 time steps)

aborting job:

Fatal error in MPI_Irecv: Other MPI error, error stack:

MPI_Irecv(143): MPI_Irecv(buf=0x8294a60, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd68c) failed

MPID_Irecv(64): Out of memory

aborting job:

Fatal error in MPI_Irecv: Other MPI error, error stack:

MPI_Irecv(143): MPI_Irecv(buf=0x8295080, count=48, MPI_DOUBLE, src=3, tag=-1, MPI_COMM_WORLD, request=0xffffd690) failed

MPID_Irecv(64): Out of memory

aborting job: Fatal error in MPI_Isend: Internal MPI error!, error stack:

MPI_Isend(142): MPI_Isend(buf=0x8295208, count=48, MPI_DOUBLE, dest=3, tag=0, MPI_COMM_WORLD, request=0xffffd678) failed

(unknown)(): Internal MPI error!

aborting job: Fatal error in MPI_Irecv: Other MPI error, error stack:

MPI_Irecv(143): MPI_Irecv(buf=0x82959b0, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd678) failed

MPID_Irecv(64): Out of memory

rank 3 in job 1 myocyte80_37021 caused collective abort of all ranks exit status of rank 3: return code 13

rank 1 in job 1 myocyte80_37021 caused collective abort of all ranks exit status of rank 1: return code 13

EDIT: (SOURCE CODE)

Header files
Variable declaration
TOTAL TIME = 
...
...
double *A = new double[Rows];
double *AA = new double[Rows];
double *B = new double[Rows;
double *BB = new double[Rows];
....
....
int Rmpi;
int my_rank;
int p;
int source; 
int dest;
int tag = 0;
function declaration

int main (int argc, char *argv[])
{
MPI_Status status[8]; 
MPI_Request request[8];
MPI_Init (&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);   
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

//PROBLEM SPECIFIC PROPERTIES. VARY BASED ON NODE 
if (Flag = 1)
{
if (my_rank == 0)
{
Defining boundary (start/stop) for special elements in tissue (Rows x Column)
}
if (my_rank == 2)
..
if (my_rank == 3)
..
if (my_rank == 4)
..
}

//INITIAL CONDITIONS ALSO VARY BASED ON NODE
for (Columns = 0; Columns<48; i++) // Normal Direction
{
for (Rows = 0; Rows<48; y++)  //Transverse Direction
{
if (Flag =1 )
{
if (my_rank == 0)
{
Initial conditions for elements
}
if (my_rank == 1) //MPI
{
}
..
..
..
//SIMULATION START

while(t[0][0] < TOTAL TIME)
{       
for (Columns=0; Columns ++) //Normal Direction
{
for (Rows=0; Rows++) //Transverse Direction
{
//SOME MORE PROPERTIES BASED ON NODE
if (my_rank == 0)
{
if (FLAG = 1)
{
Condition 1
}   
 else
{
Condition 2 
}
}

if (my_rank = 1)
....
 ....
  ...

//Evaluate functions (differential equations)
Function 1 ();
Function 2 ();
...
...

//Based on output of differential equations, different nodes estimate variable values. Since   
 the problem is of nearest neighbor, corners and edges have different neighbors/ boundary   
 conditions
if (my_rank == 0)
{
If (Row/Column at bottom_left)                  
{
Variables =
}

if (Row/Column at Bottom Right) 
{
Variables =
}
}
...
 ...

 //Keeping track of time for each element in Row and Column. Time is updated for a certain  
 element. 
 t[Column][Row] = t[Column][Row]+dt;

  }
  }//END OF ROWS AND COLUMNS

 // MPI IMPLEMENTATION. AT END OF EVERY TIME STEP, Nodes communicate with nearest neighbor
 //First step is to populate arrays with values estimated above
 for (Columns, ++) 
 {
 for (Rows, ++) 
 {
 if (my_rank == 0)
 {
 //Loading the Edges of the (Row x Column) to variables. This One dimensional Array data 
 is shared with its nearest neighbor for computation at next time step.

 if (Column == 47)
 {
 A[i] = V[Column][Row]; 
 …
 }
 if (Row == 47)
 {
 B[i] = V[Column][Row]; 
 }
 }

...
...                 

 //NON BLOCKING MPI SEND RECV TO SHARE DATA WITH NEAREST NEIGHBOR

 if ((my_rank) == 0)
 {
 MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
 MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
 MPI_Wait(&request[3], &status[3]);  
 MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[5]);
 MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
 MPI_Wait(&request[7], &status[7]);
 }

if ((my_rank) == 1)
{
MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[1]);
MPI_Wait(&request[1], &status[1]); 
MPI_Isend(Cmpi, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[3]);

MPI_Isend(D, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[6]); 
MPI_Irecv(DD, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[8]);
MPI_Wait(&request[8], &status[8]);
}

if ((my_rank) == 2)
{
MPI_Isend(E, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[2]);
MPI_Irecv(EE, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[4]);
MPI_Wait(&request[4], &status[4]);

MPI_Irecv(FF, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[5]);
MPI_Wait(&request[5], &status[5]);
MPI_Isend(Fmpi, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[7]);
}

if ((my_rank) == 3)
{
MPI_Irecv(GG, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[2]);
MPI_Wait(&request[2], &status[2]);
MPI_Isend(G, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[4]);

MPI_Irecv(HH, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], &status[6]); 
MPI_Isend(H, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[8]);
}

 //RELOADING Data (from MPI_IRecv array to array used to compute at next time step)
 for (Columns, ++) 
 {
 for (Rows, ++) 
 {
 if (my_rank == 0)
 {
 if (Column == 47)
 {
 V[Column][Row]= A[i];
 }
 if (Row == 47)
 {
 V[Column][Row]=B[i];
 }
  }

  ….
 //PRINT TO OUTPUT FILE AT CERTAIN POINT
 printval = 100; 
 if ((printdata>=printval))
 {
 prttofile ();
 printdata = 0;
 }
 printdata = printdata+1;
 compute_dt (); 

 }//CLOSE ALL TIME STEPS

 MPI_Finalize ();

  }//CLOSE MAIN
2
"Out of memory" seems as clear as it gets to me. What are you asking exactly?ildjarn
No one can accurately tell you why you're getting this error with this information alone. This question should be clarified to better describe the problem perhaps with a code snippet. As it is this will probably be closed as not a real question.AJG85
is this running on a cluster or a single machine?Adam
Have you computed how much memory your problem requires, or just monitored it with top? Like @ildjam said, out of memory is pretty clear.. Especially when you say it happens at larger problem sizes, it's possible you've got a memory leak at each time step that simply compounds until you run out.Adam
Adam: This is running on a cluster. I use the following command to submit a job: mpirun -np 4 <jobname>. I assumed it was running on 4 nodes. At our set up each node has 4 processors. However, when I use "top" I notice that, each of the 4 processors is running at 99% efficiency. The questions are the following: 1. What do you mean by memory leak? How do I go about this problem if it is a memory issue (ie. hardware based due to insufficient memory)? Basically, can I setup MPI in some way to fix this? 2. The way I submit an MPI RUN, how do i submit so that each node computes only 1 thread?Ashmohan

2 Answers

5
votes

Are you repeatedly calling MPI_Irecv? If so, you may not realize that each call allocates a request handle - and these are freed when the message is received and tested for completion with (eg.) MPI_Test. It's possible you could exhaust memory with over-use of MPI_Irecv - or the memory assigned by an MPI implementation for this purpose.

Only seeing the code would confirm the problem.

0
votes

Now that the code has been added to the question: this is indeed dirty code. You only wait for the request from the Irecvcall. Yes, if the message is received you know that the send has completed, so you don't have to wait for it. But skipping the wait gives a memory leak: the Isend allocates a new request, which the Wait would deallocate. Since you never wait, you don't deallocate and you have a memory leak.