You can share memory within a node using MPI_Win_allocate_shared
from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).
MPI functions
The following are taken from the MPI 3.1 standard.
Allocating shared memory
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)
(if you want the Fortran declaration, click the link)
You deallocate memory using MPI_Win_free
. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.
Querying the node allocations
In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)
int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)
(if you want the Fortran declaration, click the link above)
Synchronizing shared memory
MPI_WIN_SYNC(win)
IN win window object (handle)
int MPI_Win_sync(MPI_Win win)
This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.
You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize
) to attain a consistent view of data.
Synchronization
If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier
and MPI_Send
+MPI_Recv
will work for the latter. Or you can use MPI-3 atomics to build counters and locks.
Example program
The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.
This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.
#include <stdio.h>
#include <mpi.h>
/* This function synchronizes process rank i with process rank j
* in such a way that this function returns on process rank j
* only after it has been called on process rank i.
*
* No additional semantic guarantees are provided.
*
* The process ranks are with respect to the input communicator (comm). */
int p2p_xsync(int i, int j, MPI_Comm comm)
{
/* Avoid deadlock. */
if (i==j) {
return MPI_SUCCESS;
}
int rank;
MPI_Comm_rank(comm, &rank);
int tag = 666; /* The number of the beast. */
if (rank==i) {
MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
} else if (rank==j) {
MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
/* If val is the same at all MPI processes in comm,
* this function returns 1, else 0. */
int coll_check_equal(int val, MPI_Comm comm)
{
int minmax[2] = {-val,val};
MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
return ((-minmax[0])==minmax[1] ? 1 : 0);
}
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int * shptr = NULL;
MPI_Win shwin;
MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD,
&shptr, &shwin);
/* l=local r=remote */
MPI_Aint rsize = 0;
int rdisp;
int * rptr = NULL;
int lint = -999;
MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
if (rptr==NULL || rsize!=sizeof(int)) {
printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
MPI_Abort(MPI_COMM_WORLD, 1);
}
/*******************************************************/
MPI_Win_lock_all(0 /* assertion */, shwin);
if (rank==0) {
*shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
MPI_Win_sync(shwin);
}
for (int j=1; j<size; j++) {
p2p_xsync(0, j, MPI_COMM_WORLD);
}
if (rank!=0) {
MPI_Win_sync(shwin);
}
lint = *rptr;
MPI_Win_unlock_all(shwin);
/*******************************************************/
if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
if (rank==0) {
printf("SUCCESS!\n");
}
} else {
printf("rank %d: lint = %d \n", rank, lint);
}
MPI_Win_free(&shwin);
MPI_Finalize();
return 0;
}
x_
andx
(this isn't too clear in the question)? Isx_
actuallyx
at next iteration? – Gillesx_
isx
at the next iteration. Each iteration takesx
and calculatesx_
, which will be thex
for the next iteration. I will edit the question to make it more clear. – andrepdMPI_Algather
method you described, and OpenMP for intra-node parallelism, withx
andx_
shared amongst threads. – Gillesshm_open
for example) with the high risk of facing problems. You can also play with one-sided communications,MPI_Put()
and co., but again, that won't be very satisfactory. OpenMP looks much more appealing FMPOV. – Gilles