All MPI communication calls address entities called ranks. Currently all MPI implementations follow the interpretation that rank equals OS process, i.e. entities that in no way share memory space. That you can in principle address individual threads by crafting message tags is a work-around (or rather a hack) - e.g. try to implement the analogue of MPI_ANY_TAG
for multiple threads without infinite loops and/or central process-wide dispatch mechanism.
There was a proposal in the early stages of MPI-3.0 to include the so-called endpoints, which is basically a mechanism that allows to distribute a set of ranks to a set of threads inside a process. The proposal didn't make it through the voting mechanism and the text has to be further refined.
That said, the following is a plausible way to achieve what you want without reimplementing the collective call with pt2pt operations:
#pragma omp parallel shared(gbuf) private(buf,tid)
{
tid = omp_get_thread_num();
...
memcpy(gbuf + (rank*num_threads + tid)*data_per_thread,
buf, data_per_thread*size);
#pragma omp barrier
#pragma omp single
{
MPI_Allgather(MPI_IN_PLACE, 0, MPI_TYPE_NULL,
gbuf, num_threads*data_per_thread, data_type, comm);
}
...
}
It works by first gathering the data from each thread's local buffer buf
into the global receive buffer gbuf
. In this example a simple memcpy
is used but it may be more involving if a complex derived datatype is being used. Once the local data is properly gathered, an in-place MPI_Allgather
is used to collect the pieces from the other processes. The single
construct ensures that only one thread per process would make the gather-to-all call.
If the number of threads is not the same in all processes, MPI_Allgatherv
should be used instead.
mpi_allgather
or to understand howmpi_allgather
works ? – High Performance Mark