1
votes

In MPI multithread enviroment MPI calls should be protected with mutex (or other thread-lock mechanism) when we initialize MPI_Init_thread with MPI_THREAD_SERIALIZED (check this answer). This is not required with MPI_THREAD_MULTIPLE, but this is not supported by all MPI implementations.

My question is whether a lock is absolutely required for some MPI functions, specifically for MPI_Test, MPI_Wait and MPI_Get_count. I know the lock is required for all MPI calls "with communication" (such as MPI_Gather, MPI_Bcast, MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv, etc), but I suspect that this lock is not required for other functions, such as MPI_Get_count, that is a local function. I need to know if this lock is required or not for functions like MPI_Test, MPI_Wait, MPI_Get_count, MPI_Probe and MPI_Iprobe (I do not know which of these are local functions and which ones are not). Is this lock-depedency defined in the MPI Standard or is it implementation-defined?

I am developing a parallelization library with non-blocking MPI calls mixed with C++11 threads, and I need use MPI_THREAD_SERIALIZED to support the most MPI implementations. MPI_THREAD_MULTIPLE is also implemented (better performance in most cases) in the library, but the MPI_THREAD_SERIALIZED support is also required.

In the next simple example code, is required the lock before the MPI_Test call?

#include <mutex>
#include <vector>
#include <thread>
#include <iostream>
#include <mpi.h>

static std::mutex mutex;
const static int numThreads = 4;
static int rank;
static int nprocs;

static void rthread(const int thrId) {
    int recv_buff[2];
    int send_buff[2];
    MPI_Request recv_request;

    {
        std::lock_guard<std::mutex> lck(mutex);     // <-- this lock is required
        MPI_Irecv(recv_buff, 2, MPI_INT, ((rank>0) ? rank-1 : nprocs-1), thrId, MPI_COMM_WORLD, &recv_request);
    }

    send_buff[0] = thrId;
    send_buff[1] = rank;
    {
        std::lock_guard<std::mutex> lck(mutex);     // <-- this lock is required
        MPI_Send(send_buff, 2, MPI_BYTE, ((rank+1<nprocs) ? rank+1 : 0), thrId, MPI_COMM_WORLD);
    }

    int flag = 0;
    while (!flag) {
        std::lock_guard<std::mutex> lck(mutex);    // <-- is this lock required?
        MPI_Test(&recv_request, &flag, MPI_STATUS_IGNORE);
        //...        do other stuff
    }

    std::cout << "[Rank " << rank << "][Thread " << thrId << "] Received a msg from thread " << recv_buff[0] << " from rank " << recv_buff[1] << std::endl;

}

int main(int argc, char **argv) {
    int provided;

    MPI_Init_thread(&(argc), &(argv), MPI_THREAD_SERIALIZED, &provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    std::vector<std::thread> threads;
    for(int threadId = 0; threadId < numThreads; threadId++) {
        threads.push_back(std::thread(rthread, threadId));
    }
    for(int threadId = 0; threadId < numThreads; threadId++) {
        threads[threadId].join();
    }
    MPI_Finalize();
}

In my tests I executed some code without locks in MPI_Test and MPI_Get_count calls, nothing bad happened and the performance improves, but I don't know if this is ok or not.

1
"I executed some code without locks in MPI_Test and MPI_Get_count calls, nothing bad happened" - Just because nothing bad happened in your tests does not mean that nothing bad will happen on other users machines in the future. Race conditions are a bitch and can bite at any time. Also, future users may run your code on different hardware and unless you strictly abide by the guarantees specified in the standard, you do not know what results you'll get. "Works on my machine" means Nothing.Jesper Juhl

1 Answers

1
votes

The lock is required. The standard just states briefly:

MPI_THREAD_SERIALIZED The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads

So there is no distinction between calls to different kinds of MPI functions. Since you aim to write portable code - otherwise you could just assume an implementation with MPI_THREAD_MULTIPLE - you have to stick with the standard.