MPI shared memory synchronization problem

Question

Have a problem with synchronization of values shared as MPI memory window. Reason for using shared memory is that memory structure is too large to have a copy on every process, but the calculation of its elements needs to be distributed. So, idea is to have only one data structure per node.

Here is the simplified version of the code which contains minimal subset which should describe the problem. I skip the part where I do synchronization between nodes.

I have two problems:

Synchronization (passive target, lock/unlock epoch) is extremely slow.
The result shows that there is some inconsistency inside the epochs (lock/unlock blocks). Obviously, there is a race condition problem.

I've tried with active target synchronization (MPI_Win_Fence()), but the same problems occur. Since I don't have many experiences with this, could be that I simply use the wrong approach.

MPI_Comm nodecomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
    MPI_INFO_NULL, &nodecomm);

MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);

int local_xx_size = 0;
if (noderank == 0){
    local_xx_size = xx_size;
}

MPI_Win win_xx;
MPI_Aint winsize;    
double *xx, *local_xx;    
MPI_Win_allocate_shared(local_xx_size*sizeof(double), sizeof(double),
    MPI_INFO_NULL, nodecomm, &local_xx, &win_xx);

xx = local_xx;
if (noderank != 0){
    MPI_Win_shared_query(win_xx, 0, &winsize, &windisp, &xx);
}

//init xx
if(noderank == 0){
    MPI_Win_lock_all(0, win_xx);
    for (i=0; i<xx_size; i++){
        xx[i]=0.0;
    }
    MPI_Win_unlock_all(win_xx);
}
MPI_Barrier(nodecomm);

long counter = 0;
for(i = 0; i < largeNum; i++) {
    //some calculations
    for(j = 0; j < xx_size; j++) {
        //calculate res
        MPI_Win_lock_all(0, win_xx);
        xx[counter] += res; //update value
        MPI_Win_unlock_all(win_xx);
    }
}
MPI_Barrier(nodecomm);

//use xx (sync data from all the nodes)
MPI_Win_free(&win_xx);

I would appreciate any help and suggestion regarding these problems.

Jeff Hammond Jeff Hammond · Accepted Answer · 2019-06-07T19:12:33

Short explanation

MPI lock/unlock do not themselves cause atomic updates.

You shouldn't use lock/unlock more than necessary anyways. Use flush instead. Only lock and unlock the window when you allocate and free it.

You can get atomicity using MPI accumulate functions (Accumulate, Get_accumulate, Fetch_and_op, Compare_and_swap) or - and only in the case of shared memory - you can use atomic primitives associated with your compiler. Because this is a bit tricky with C11/C++11 because they require types, I show the example below with intrinsics that are supposed by most if not all common compilers.

Code change suggestion

I do not know if this is correct. It merely demonstrates the concepts noted above.

MPI_Comm nodecomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
    MPI_INFO_NULL, &nodecomm);

MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);

int local_xx_size = 0;
if (noderank == 0){
    local_xx_size = xx_size;
}

MPI_Win win_xx;
MPI_Aint winsize;    
double *xx, *local_xx;    
MPI_Win_allocate_shared(local_xx_size*sizeof(double), sizeof(double), MPI_INFO_NULL, nodecomm, &local_xx, &win_xx);
MPI_Win_lock_all(0, win_xx);

xx = local_xx;
if (noderank != 0){
    MPI_Win_shared_query(win_xx, 0, &winsize, &windisp, &xx);
}

//init xx
if(noderank == 0){

    for (i=0; i<xx_size; i++){
        xx[i]=0.0;
    }
}
MPI_Barrier(nodecomm);

long counter = 0;
for(i = 0; i < largeNum; i++) {
    //some calculations
    for(j = 0; j < xx_size; j++) {
        //calculate res
        // xx[counter] += res; //update value
#ifdef USE_RMA_ATOMICS
        // check the arguments - I don't know if I calculate the target+displacement right
        int target = counter/local_xx_size;
        MPI_Aint disp = counter%local_xx_size;
        MPI_Accumulate(&res, MPI_LONG, target, disp, 1, MPI_LONG, MPI_SUM, win_xx);
        MPI_Win_flush(target, win_xx);
#else
# ifdef USE_NEWER_INTRINSICS // GCC, Clang, Intel support this AFAIK
        __atomic_fetch_add (&xx[counter], res, __ATOMIC_RELAXED);
# else // GCC, Clang, Intel, IBM support this AFAIK
        __sync_fetch_and_add(&xx[counter], res);
# endof
#endif
    }
}
MPI_Barrier(nodecomm);

MPI_Win_unlock_all(win_xx);
MPI_Win_free(&win_xx);

MPI shared memory synchronization problem

1 Answers

Short explanation

Code change suggestion