3
votes

The MPI-3 standard introduces shared-memory, that can be read and written by all processes sharing this memory without using calls to the MPI library. While there are examples of one-sided communications using shared or non-shared memory, I did not find much information about how to use shared memory correctly with direct access.

I ended up doing something like this, which works well, but I was wondering if the MPI standard guarantees that it will always work?

// initialization:
MPI_Comm comm_shared;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, i_mpi, MPI_INFO_NULL, &comm_shared);

// allocation
const int N_WIN=10;
const int mem_size = 1000*1000;
double* mem[10];
MPI_Win win[N_WIN];
for (int i=0; i<N_WIN; i++) {   // I need several buffers.
    MPI_Win_allocate_shared( mem_size, sizeof(double), MPI_INFO_NULL, comm_shared, &mem[i], &win[i] );
    MPI_Win_lock_all(0, win);
}

while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

// deallocation
for (int i=0; i<N_WIN; i++) {
    MPI_Win_unlock_all(win[i]);
    MPI_Win_free(&win[i]);
}

Here, I ensure synchronization by using MPI_Barrier() and assume the hardware makes the memory view consistent. Furthermore, because I have several shared windows, a single call to MPI_Barrier seems more efficient than calling MPI_Win_fence() on every shared memory window.

It seems to work well an my x86 laptops and servers. But is this programm a valid/correct MPI program? Is there a more efficient method of achieving the same thing?

2

2 Answers

3
votes

There are two key issues here:

  1. MPI_Barrier is absolutely not a memory barrier and should never be used that way. It may synchronize memory as a side-effect of its implementation in most cases, but users can never assume that. MPI_Barrier is only guaranteed to synchronize process execution. (If it helps, you can imagine a system where MPI_Barrier is implemented using a hardware widget that does not more than the MPI standard requires. IBM Blue Gene sort of did this in some cases.)
  2. This question is unanswerable without details on what you are actually doing with shared-memory here:
while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

It may not be written clearly, but it was assumed by the authors of the relevant text of the MPI-3 standard - I was part of this group - that one could reason about shared-memory using the memory model of the underlying/host language. Thus, if you are writing this code in C11, you can reason about it according to the C11 memory model.

If you want to use MPI to synchronize shared memory, then you should use MPI_Win_sync on all the windows for load-store accesses and MPI_Win_flush for RMA operations (Put/Get/Accumulate/Get_accumulate/Fetch_and_op/Compare_and_swap).

I expect MPI_Win_sync to be implemented as a CPU memory barrier, so it is redundant to call it for every window. This is why it may be more effective to assume C11 or C++11 memory models and use https://en.cppreference.com/w/c/atomic/atomic_thread_fence and https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence, respectively.

1
votes

I would be tempted to say this MPI program is not valid.

To explain what I base my opinion on

  • In the description of MPI_Win_create_shared:

    The consistency of load/store accesses from/to the shared memory as observed by the user program depends on the architecture. A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization functions (see Section 11.5) or explicitly completing outstanding store accesses (e.g., by calling MPI_WIN_FLUSH). MPI does not define semantics for accessing shared memory windows in the separate memory model.

  • Section 11.4, about the memory models, which states:

    In the RMA unified model, public and private copies are identical and updates via put or accumulate calls are eventually observed by load operations without additional RMA calls. A store access to a window is eventually visible to remote get or accumulate calls without additional RMA calls. These stronger semantics of the RMA unified model allow the user to omit some synchronization calls and potentially improve performance.

  • In the advice to users that follows only indicates:

    If accesses in the RMA unified model are not synchronized (with locks or flushes, see Section 11.5.3), load and store operations might observe changes to the memory while they are in progress.

  • Section 11.7, semantic and correctness says:

    MPI_BARRIER provides process synchronization, but not memory synchronization.

  • The different examples in 11.8 explain well how to use flush and sync operations.

The only synchronization ever addressed is always and only one-sided ones, i.e. in your case, MPI_Win_flush{,_all}, or MPI_Win_unlock{,_all} (except the mutual exclusion of active and passive concurrent synchronization that has to be enforced by the user, or the usage of MPI_MODE_NOCHECK assert flag).

So either you access directly memory with store, and you need to call MPI_Win_sync() on each of your windows before calling MPI_Barrier (as explained in example 11.10) to ensure synchronization, or you are doing RMA accesses and then you would have to call at least MPI_Win_flush_all before the second barrier to ensure the operations have been propagated. If you try to read using load operation, you may have to synchronize after the second barrier as well before doing so.

Another solution would be to unlock and re-lock between barriers, or to use Compiler and hardware specific notations could ensure the load occurs after the data is updated.