Share memory across MPI nodes to prevent unecessary copying

Question

I have an algorithm where in each iteration each node has to calculate a segment of an array, where each element of x_ depends on all the elements of x.

x_[i] = some_func(x) // each x_[i] depends on the entire x

That is, each iteration takes x and calculates x_, which will be the new x for the next iteration.

A way of paralelizing this is MPI would be to split x_ between the nodes and have an Allgather call after the calculation of x_ so that each processor would send its x_ to the appropriate location in x in all the other processors, then repeat. This is very inefficient since it requires an expensive Allgather call every iteration, not to mention it requires as many copies of x as there are nodes.

I've thought of an alternative way that doesn't require copying. If the program is running on a single machine, with shared RAM, would it be possible to just share the x_ between the nodes (without copying)? That is, after calculating x_ each processor would make it visible to the other nodes, which could then use it as their x for the next iteration without needing to make several copies. I can design the algorithm so that no processor accesses the same x_ at the same time, which is why making a private copy for each node is overkill.

I guess what I'm asking is: can I share memory in MPI simply by tagging an array as shared-between-nodes, as opposed to manually making a copy for each node? (for simplicity assume I'm running on one CPU)

What is the relation between x_ and x (this isn't too clear in the question)? Is x_ actually x at next iteration? — Gilles
@Gilles Yes, x_ is x at the next iteration. Each iteration takes x and calculates x_, which will be the x for the next iteration. I will edit the question to make it more clear. — andrepd
What about MPI+OpenMP? MPI for inter-node parallelism (if/when needed) with the MPI_Algather method you described, and OpenMP for intra-node parallelism, with x and x_ shared amongst threads. — Gilles
And BTW, no, MPI doesn't permit by itself to share a memory segment across processes on the same node. If you want to do that, you have to manage it by hand (with shm_open for example) with the high risk of facing problems. You can also play with one-sided communications, MPI_Put() and co., but again, that won't be very satisfactory. OpenMP looks much more appealing FMPOV. — Gilles

Jeff Hammond Jeff Hammond · Accepted Answer · 2016-01-17T18:09:01

You can share memory within a node using MPI_Win_allocate_shared from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).

MPI functions

The following are taken from the MPI 3.1 standard.

Allocating shared memory

MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)

int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)

(if you want the Fortran declaration, click the link)

You deallocate memory using MPI_Win_free. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.

Querying the node allocations

In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).

MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)

int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)

(if you want the Fortran declaration, click the link above)

Synchronizing shared memory

MPI_WIN_SYNC(win)
IN win window object (handle)

int MPI_Win_sync(MPI_Win win)

This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.

You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize) to attain a consistent view of data.

Synchronization

If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier and MPI_Send+MPI_Recv will work for the latter. Or you can use MPI-3 atomics to build counters and locks.

Example program

The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.

This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.

#include <stdio.h>
#include <mpi.h>

/* This function synchronizes process rank i with process rank j
 * in such a way that this function returns on process rank j
 * only after it has been called on process rank i.
 *
 * No additional semantic guarantees are provided.
 *
 * The process ranks are with respect to the input communicator (comm). */

int p2p_xsync(int i, int j, MPI_Comm comm)
{
    /* Avoid deadlock. */
    if (i==j) {
        return MPI_SUCCESS;
    }

    int rank;
    MPI_Comm_rank(comm, &rank);

    int tag = 666; /* The number of the beast. */

    if (rank==i) {
        MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
    } else if (rank==j) {
        MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
    }

    return MPI_SUCCESS;
}

/* If val is the same at all MPI processes in comm,
 * this function returns 1, else 0. */

int coll_check_equal(int val, MPI_Comm comm)
{
    int minmax[2] = {-val,val};
    MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
    return ((-minmax[0])==minmax[1] ? 1 : 0);
}

int main(int argc, char * argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int *   shptr = NULL;
    MPI_Win shwin;
    MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
                            MPI_INFO_NULL, MPI_COMM_WORLD,
                            &shptr, &shwin);

    /* l=local r=remote */
    MPI_Aint rsize = 0;
    int rdisp;
    int * rptr = NULL;
    int lint = -999;
    MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
    if (rptr==NULL || rsize!=sizeof(int)) {
        printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    /*******************************************************/

    MPI_Win_lock_all(0 /* assertion */, shwin);

    if (rank==0) {
        *shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
        MPI_Win_sync(shwin);
    }
    for (int j=1; j<size; j++) {
        p2p_xsync(0, j, MPI_COMM_WORLD);
    }
    if (rank!=0) {
        MPI_Win_sync(shwin);
    }
    lint = *rptr;

    MPI_Win_unlock_all(shwin);

    /*******************************************************/

    if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
        if (rank==0) {
            printf("SUCCESS!\n");
        }
    } else {
        printf("rank %d: lint = %d \n", rank, lint);
    }

    MPI_Win_free(&shwin);

    MPI_Finalize();

    return 0;
}