Have a problem with synchronization of values shared as MPI memory window. Reason for using shared memory is that memory structure is too large to have a copy on every process, but the calculation of its elements needs to be distributed. So, idea is to have only one data structure per node.
Here is the simplified version of the code which contains minimal subset which should describe the problem. I skip the part where I do synchronization between nodes.
I have two problems:
- Synchronization (passive target, lock/unlock epoch) is extremely slow.
- The result shows that there is some inconsistency inside the epochs (lock/unlock blocks). Obviously, there is a race condition problem.
I've tried with active target synchronization (MPI_Win_Fence()), but the same problems occur. Since I don't have many experiences with this, could be that I simply use the wrong approach.
MPI_Comm nodecomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
int local_xx_size = 0;
if (noderank == 0){
local_xx_size = xx_size;
}
MPI_Win win_xx;
MPI_Aint winsize;
double *xx, *local_xx;
MPI_Win_allocate_shared(local_xx_size*sizeof(double), sizeof(double),
MPI_INFO_NULL, nodecomm, &local_xx, &win_xx);
xx = local_xx;
if (noderank != 0){
MPI_Win_shared_query(win_xx, 0, &winsize, &windisp, &xx);
}
//init xx
if(noderank == 0){
MPI_Win_lock_all(0, win_xx);
for (i=0; i<xx_size; i++){
xx[i]=0.0;
}
MPI_Win_unlock_all(win_xx);
}
MPI_Barrier(nodecomm);
long counter = 0;
for(i = 0; i < largeNum; i++) {
//some calculations
for(j = 0; j < xx_size; j++) {
//calculate res
MPI_Win_lock_all(0, win_xx);
xx[counter] += res; //update value
MPI_Win_unlock_all(win_xx);
}
}
MPI_Barrier(nodecomm);
//use xx (sync data from all the nodes)
MPI_Win_free(&win_xx);
I would appreciate any help and suggestion regarding these problems.