Dealing with complex send recv message within a for loop

Question

I am trying to parallelise a biological model in C++ with boost::mpi. It is my first attempt, and I am entirely new to the boost library (I have started from the Boost C++ Libraries book by Schaling). The model consists of grid cells and cohorts of individuals living within each grid cell. The classes are nested, such that a vector of Cohorts* belongs to a GridCell. The model runs for 1000 years, and at each time step, there is dispersal such that the cohorts of individuals move randomly between grid cells. I want to parallelise the content of the for loop, but not the loop itself as each time step depends on the state of the previous time.

I use world.send() and world.recv() to send the necessary information from one rank to another. Because sometimes there is nothing to send between ranks I use with mpi::status and world.iprobe() to make sure the code does not hang waiting for a message that was never sent (I followed this tutorial)

The first part of my code seems to work fine but I am having troubles with making sure all the sent messages have been received before moving on to the next step in the for loop. In fact, I noticed that some ranks move on to the following time step before the other ranks have had the time to send their messaages (or at least that what it looks like from the output)

I am not posting the code because it consists of several classes and it’s quite long. If interested the code is on github. I write here roughly the pseudocode. I hope this will be enough to understand the problem.

int main()
{
    // initialise the GridCells and Cohorts living in them

    //depending on the number of cores requested split the 
    //grid cells that are processed by each core evenly, and 
    //store the relevant grid cells in a vector of  GridCell*

    // start to loop through each time step
    for (int k = 0; k < (burnIn+simTime); k++) 
    {
        // calculate the survival and reproduction probabilities 
        // for each Cohort and the dispersal probability

        // the dispersing Cohorts are sorted based on the rank of
        // the destination and stored in multiple vector<Cohort*>

        // I send the vector<Cohort*> with 
        world.send(…)

        // the receiving rank gets the vector of Cohorts with: 
        mpi::status statuses[world.size()];
        for(int st = 0; st < world.size(); st++)
        {
            ....
            if( world.iprobe(st, tagrec) )    
            statuses[st] = world.recv(st, tagrec, toreceive[st]);
            //world.iprobe ensures that the code doesn't hang when there
            // are no dispersers
        }
        // do some extra calculations here

        //wait that all processes are received, and then the time step ends. 
        //This is the bit where I am stuck. 
        //I've seen examples with wait_all for the non-blocking isend/irecv,
        // but I don't think it is applicable in my case.
        //The problem is that I noticed that some ranks proceed to the next
        //time step before all the other ranks have sent their messages.
    }
}

I compile with

mpic++ -I/$HOME/boost_1_61_0/boost/mpi -std=c++11  -Llibdir \-lboost_mpi -lboost_serialization -lboost_locale  -o out

and execute with mpirun -np 5 out, but I would like to be able to execute with a higher number of cores on an HPC cluster later on (the model will be run at the global scale, and the number of cells might depend on the grid cell size chosen by the user). The compilers installed are g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, Open MPI: 2.1.1

Zulan Zulan · Accepted Answer · 2018-10-10T08:11:21

The fact that you have nothing to send is an important piece of information in your scenario. You can not deduce that fact from only the absence of a message. The absence of a message only means nothing was sent yet.

Simply sending a zero-sized vector and skipping the probing is the easiest way out.

Otherwise you would probably have to change your approach radically or implement a very complex speculative execution / rollback mechanism.

Also note that the linked tutorial uses probe in a very different fashion.

Dealing with complex send recv message within a for loop

1 Answers