combining MPI_Allgather and openmp parallel

Question

I am trying to use a MPI_Allgather kind of functionality among openmp threads. That is all the threads will collectively have a copy a data(but each will generate only a part of it).

Infact by using the thread id in "tag" field of MPI_SEND and MPI_RECV, communication between openmp threads is possible. But MPI_Allgather does not seem to work that way.

Example

I have 2 MPI process. Each process has 2 openmp threads. Now each of these 4 threads computes part of data. I want to use MPI_Allgather so that all threads combine their data and have a copy of the combined data. But it seems MPI_Allgather only works in the granularity of MPI processes.

I will appreciate any suggestions on how to do it.

Thanks!

I don't understand your question. Are you trying to write an OpenMP program (or routine) which operates like mpi_allgather or to understand how mpi_allgather works ? — High Performance Mark
I am using both openmp and MPI in a single program. Multiple processes and each process has multiple threads. Now I am trying to apply the MPI_Allgather across all of these processes/threads. That is each thread in each process will have a copy of the data after MPI_Allgather is executed. But it seems MPI_Allgather does not work like that. So I am looking for ways to do it. — Aniket Chakrabarti
BTW...using MPI_Send one can send data from thread i of process j to thread p of process q. — Aniket Chakrabarti
I suggest you edit your question to explain more clearly what you want to learn. Material in comment is often, rightly, ignored. — High Performance Mark

Hristo Iliev Hristo Iliev · Accepted Answer · 2013-09-27T11:49:40

All MPI communication calls address entities called ranks. Currently all MPI implementations follow the interpretation that rank equals OS process, i.e. entities that in no way share memory space. That you can in principle address individual threads by crafting message tags is a work-around (or rather a hack) - e.g. try to implement the analogue of MPI_ANY_TAG for multiple threads without infinite loops and/or central process-wide dispatch mechanism.

There was a proposal in the early stages of MPI-3.0 to include the so-called endpoints, which is basically a mechanism that allows to distribute a set of ranks to a set of threads inside a process. The proposal didn't make it through the voting mechanism and the text has to be further refined.

That said, the following is a plausible way to achieve what you want without reimplementing the collective call with pt2pt operations:

#pragma omp parallel shared(gbuf) private(buf,tid)
{
   tid = omp_get_thread_num();
   ...
   // Gather threads' data into the shared buffer 'gbuf'
   // Data should start at gbuf[rank * num_threads * data_per_thread]
   memcpy(gbuf + (rank*num_threads + tid)*data_per_thread,
          buf, data_per_thread*size);

   // Make sure all threads have finished copying their chunks
   // then gather the data from all the other ranks
   #pragma omp barrier
   #pragma omp single
   {
       MPI_Allgather(MPI_IN_PLACE, 0, MPI_TYPE_NULL,
                     gbuf, num_threads*data_per_thread, data_type, comm);
   }
   ...
}

It works by first gathering the data from each thread's local buffer buf into the global receive buffer gbuf. In this example a simple memcpy is used but it may be more involving if a complex derived datatype is being used. Once the local data is properly gathered, an in-place MPI_Allgather is used to collect the pieces from the other processes. The single construct ensures that only one thread per process would make the gather-to-all call.

If the number of threads is not the same in all processes, MPI_Allgatherv should be used instead.

combining MPI_Allgather and openmp parallel

Example

1 Answers