1
votes

I have an MPI program that runs on a cluster of machines. However, the program does not run to completion and I am unable to identify the reason. The main function consists of two main clauses (an if clause and an else clause):

#define SERVER 0

if(my_rank == SERVER)
{
   //do something
}
else
{
   //do something else
}

The problem seems to be in the "do something else" part and I would like to debug it with gdb. When I run the executable with gdb, I'm only able to step into the if clause because it seems MPI automatically assigns a rank of 0 to the main process (the one that launches the program). I looked into environment variables but I haven't found a flag that to pre-determine the rank of the main process. How can I debug the else clause?

1
Which MPI (or MPI-2) implementation are you using?Henrik
The answers to this question -- stackoverflow.com/questions/329259/… -- may be of assistance to youHigh Performance Mark
I'm using version 3.0.4NewToAndroid
Thanks for the stackoverflow link.. I found this particular post helpful: "I use this little homebrewn method to attach debugger to MPI processes - call the following function, DebugWait(), right after MPI_Init() in your code. Now while the processes are waiting for keyboard input, you have all the time to attach the debugger to them and add breakpoints. When you are done, provide a single character input. static void DebugWait(int rank) { char a; if(rank == 0) { scanf("%c", &a); } MPI_Bcast(&a, 1, MPI_BYTE, 0, MPI_COMM_WORLD); }"NewToAndroid
The above method does not solve my problem though. So I called the function DebugWait after MPIInit. While the program is hung, I attached gdb to one of the processes and set a break point at the line number of "else". Then I provided a character input to begin running the program. When I call "next", gdb prints "Single stepping until exit from function MPID_nem_tcp_connpoll, which has no line number information. 0x000000000042e32a in MPIDI_CH3I_Progress ()"NewToAndroid

1 Answers

0
votes

It's very hard to tell what's going on without seeing the code (only post code if you cut it down to the MWE), but usually when you get hung up in the progress engine inside MPICH, it's because your matching is incorrect. My guess based on what you've put in the comments is that you aren't calling MPI_INIT on all ranks. Make sure that you are doing so and that all of your send/receive calls match up (and your collectives). If that still doesn't work, cut it down to the MWE and post that here.