MPI gather array on root process

Question

I'm new to MPI. I have 4 processes: processes 1 through 3 populate a vector and send it to process 0, and process 0 collects the vectors into one very long vector. I have code that works (too long to post), but process 0's recv operation is clumsy and very slow.

In abstract, the code does the following:

MPI::Init();
int id = MPI::COMM_WORLD.Get_rank();

if(id>0) {
    double* my_array = new double[n*m]; //n,m are int
    Populate(my_array, id);
    MPI::COMM_WORLD.Send(my_array,n*m,MPI::DOUBLE,0,50);
}

if(id==0) {
    double* all_arrays = new double[3*n*m];
    /* Slow Code Starts Here */
    double startcomm = MPI::Wtime();
    for (int i=1; i<=3; i++) {
    MPI::COMM_WORLD.Recv(&all_arrays[(i-1)*m*n],n*m,MPI::DOUBLE,i,50);
    }
    double endcomm = MPI::Wtime();
    //Process 0 has more operations...
}
MPI::Finalize();

It turns out that endcomm - startcomm accounts for 50% of the total time (0.7 seconds compared to 1.5 seconds for the program to complete).

Is there a better way to receive the vectors from processes 1-3 and store them in process 0's all_arrays?

I checked out MPI::Comm::Gather, but I'm not sure how to use it. In particular, will it allow me to specify that process 1's array is the first array in all_arrays, process 2's array the second, etc.? Thanks.

Edit: I removed the "slow" loop, and instead put the following between the "if" blocks:

MPI_Gather(my_array,n*m,MPI_DOUBLE,
    &all_arrays[(id-1)*m*n],n*m,MPI_DOUBLE,0,MPI_COMM_WORLD);

The same slow performance resulted. Does this have something to do with the fact that the root process "waits" for each individual receive to complete before attempting the next one? Or is that not the right way to think about it?

Just a quick comment - the C++ bindings are deprecated in the current MPI standard version 2.2 and will be completely removed in the upcoming MPI 3.0. It is recommended that you learn and use the C interface instead for portability reasons. — Hristo Iliev
How large are n and m in your program, and what kind of connection is there between your machines? — suszterpatt
suszterpatt-- n=3500, m=7, so n*m=24,500. I don't know how to determine the connection. If it helps, right now I'm testing this code on my laptop, Macbook Pro 2.2 GHz Core i7, but will soon run it on a sun grid engine cluster. — covstat
@covstat There should be no connection issue since you're on shared memory. What performance do you expect to see? Whatever MPI operation you use, this kind of data gathering is always slower than that exchange between, say, nearest neighbors. I am not familiar with Macs, so it could be OS/compiler/MPI build issue as well, which is beyond my knowledge. — milancurcic

milancurcic milancurcic · Accepted Answer · 2012-05-07T23:50:13

Yes, MPI_Gather will do exactly that. From the anl page for MPI_Gather:

int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, 
               void *recvbuf, int recvcnt, MPI_Datatype recvtype, 
               int root, MPI_Comm comm)

Here, sendbuf is your array on each process (my_array). recvbuf is the long array (all_arrays) on the receiving process into which short arrays are being gathered to. The short array on the receiving process is being copied into its contiguous position in the long array, so you don't need to worry about doing it yourself. The arrays from each process will be arranged contiguously in the long array.

EDIT:

In the case where the receiving process does not contribute sendbuf in the gathering, you may want to use MPI_Gatherv instead (Thanks to @HristoIliev for pointing this out).

MPI gather array on root process

1 Answers