2
votes

I have a simple C++ struct that basically wraps a standard C array:

struct MyArray {
    T* data;
    int length;
    // ...
}

where T is a numeric type like float or double. length is the number of elements in the array. Typically my arrays are very large (tens of thousands up to tens of millions of elements).

I have an MPI program where I would like to expose two instances of MyArray, say a_old and a_new, as shared memory objects via MPI 3 shared memory. The context is that each MPI rank reads from a_old. Then, each MPI rank writes to certain indices of a_new (each rank only writes to its own set of indices - no overlap). Finally, a_old = a_new must be set on all ranks. a_old and a_new are the same size. Right now I'm making my code work by syncing (Isend/Irecv) each rank's updated values with other ranks. However, due to the data access pattern, there's no reason I need to incur the overhead of message passing and could instead have one shared memory object and just put a barrier before a_old = a_new. I think this would give me better performance (though please correct me if I'm wrong).

I have had trouble finding complete code examples of doing shared memory with MPI 3. Most sites only provide reference documentation or incomplete snippets. Could someone walk me through a simple and complete code example that does the sort of thing I'm trying to achieve (updating and syncing a numeric array via MPI shared memory)? I understand the main concepts of creating shared memory communicators and windows, setting fences, etc., but it would really help my understanding to see one example that puts it all together.

Also, I should mention that I'll only be running my code on one node, so I don't need to worry about needing multiple copies of my shared-memory object across nodes; I just need one copy of my data for the single node on which my MPI processes are running. Despite this, other solutions like OpenMP aren't feasible for me in this case, since I have a ton of MPI code and can't rewrite everything for the sake of one or two arrays I'd like to share.

2

2 Answers

7
votes

Using shared memory with MPI-3 is relatively simple.

First, you allocate the shared memory window using MPI_Win_allocate_shared:

MPI_Win win;
MPI_Aint size;
void *baseptr;

if (rank == 0)
{
   size = 2 * ARRAY_LEN * sizeof(T);
   MPI_Win_allocate_shared(size, sizeof(T), MPI_INFO_NULL,
                           MPI_COMM_WORLD, &baseptr, &win);
}
else
{
   int disp_unit;
   MPI_Win_allocate_shared(0, sizeof(T), MPI_INFO_NULL,
                           MPI_COMM_WORLD, &baseptr, &win);
   MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
}
a_old.data = baseptr;
a_old.length = ARRAY_LEN;
a_new.data = a_old.data + ARRAY_LEN;
a_new.length = ARRAY_LEN;

Here, only rank 0 allocates memory. It doesn't really matter which process allocates it as it is shared. It is even possible to have each process allocate a portion of the memory, but since by the default the allocation is contiguous, both methods are equivalent. MPI_Win_shared_query is then used by all other processes to find out the location in their virtual address space of the beginning of the shared memory block. That address might vary among the ranks and therefore one should not pass around absolute pointers.

You can now simply load from and store into a_old.data respectively a_new.data. As the ranks in your case work on disjoint sets of memory locations, you don't really need to lock the window. Use window locks to implement e.g. protected initialisation of a_old or other operations that require synchronisation. You might also need to explicitly tell the compiler not to reorder the code and to emit a memory fence in order to have all outstanding load/store operations finished before e.g. you call MPI_Barrier().

The a_old = a_new code suggests copying one array onto the other. Instead, you could simply swap the data pointers and eventually the size fields. Since only the data of the array is in the shared memory block, swapping the pointers is a local operation, i.e. no synchronisation needed. Assuming that both arrays are of equal length:

T *temp;
temp = a_old.data;
a_old.data = a_new.data;
a_new.data = temp;

You still need a barrier to make sure that all other processes have finished processing before continuing further.

At the very end, simply free the window:

MPI_Win_free(&win);

A complete example (in C) follows:

#include <stdio.h>
#include <mpi.h>

#define ARRAY_LEN 1000

int main (void)
{
   MPI_Init(NULL, NULL);

   int rank, nproc;
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &nproc);

   MPI_Win win;
   MPI_Aint size;
   void *baseptr;

   if (rank == 0)
   {
      size = ARRAY_LEN * sizeof(float);
      MPI_Win_allocate_shared(size, sizeof(int), MPI_INFO_NULL,
                              MPI_COMM_WORLD, &baseptr, &win);
   }
   else
   {
      int disp_unit;
      MPI_Win_allocate_shared(0, sizeof(int), MPI_INFO_NULL,
                              MPI_COMM_WORLD, &baseptr, &win);
      MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
   }

   printf("Rank %d, baseptr = %p\n", rank, baseptr);

   int *arr = baseptr;
   for (int i = rank; i < ARRAY_LEN; i += nproc)
     arr[i] = rank;

   MPI_Barrier(MPI_COMM_WORLD);

   if (rank == 0)
   {
      for (int i = 0; i < 10; i++)
         printf("%4d", arr[i]);
      printf("\n");
   }

   MPI_Win_free(&win);

   MPI_Finalize();
   return 0;
}

Disclaimer: Take this with a grain of salt. My understanding of MPI's RMA is still quite weak.

3
votes

Here is a code that feeds your description. In comments I put little descriptions about the code. Generally its presenting a dynamic RMA Window and the memory has to be allocated and at to the window.

MPI_Win_lock_all(0, win) Description from Open MPI Documentation:

Starts an RMA access epoch to all processes in win, with a lock type of MPI_LOCK_SHARED. During the epoch, the calling process can access the window memory on all processes in win by using RMA operations.

Where I have used MPI_INFO_NULL you can use an MPI_Info object to provide additional information to MPI but it is depends on your memory access pattern.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

typedef struct MyArray {
    double* data;
    int length;
}MyArray;

#define ARRAY_SIZE 10

int main(int argc, char *argv[]) {
    int rank, worldSize, i;
    MPI_Win win;
    MPI_Aint disp;
    MPI_Aint *allProcessDisp;
    MPI_Request *requestArray;

    MyArray myArray;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &worldSize);

    MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win);

    allProcessDisp = malloc(sizeof(MPI_Aint) * worldSize);

    requestArray = malloc(sizeof(MPI_Request) * worldSize);
    for (i = 0; i < worldSize; i++) 
        requestArray[i] = MPI_REQUEST_NULL;

    myArray.data = malloc(sizeof(double) * ARRAY_SIZE);
    myArray.length = ARRAY_SIZE;

    //Allocating memory for each process share window space 
    MPI_Alloc_mem(sizeof(double) * ARRAY_SIZE, MPI_INFO_NULL, &myArray.data);
    for (i = 0; i < ARRAY_SIZE; i++)
        myArray.data[i] = rank;

    //attach the allocating memory to each process share window space 
    MPI_Win_attach(win, myArray.data, sizeof(double) * ARRAY_SIZE);

    MPI_Get_address(myArray.data, &disp);

    if (rank == 0) {
        allProcessDisp[0] = disp;
        //Collect all displacements
        for (i = 1; i < worldSize; i++) {
            MPI_Irecv(&allProcessDisp[i], 1, MPI_AINT, i, 0, MPI_COMM_WORLD, &requestArray[i]);
        }
        MPI_Waitall(worldSize, requestArray, MPI_STATUS_IGNORE);
        MPI_Bcast(allProcessDisp, worldSize, MPI_AINT, 0, MPI_COMM_WORLD);
    }
    else {
        //send displacement 
        MPI_Send(&disp, 1, MPI_AINT, 0, 0, MPI_COMM_WORLD);
        MPI_Bcast(allProcessDisp, worldSize, MPI_AINT, 0, MPI_COMM_WORLD);
    }

    // here you can do RMA operations 
    // Each time you need an RMA operation you start with 
    double otherRankData = -1.0;
    int otherRank = 1;
    if (rank == 0) {
        MPI_Win_lock_all(0, win);
        MPI_Get(&otherRankData, 1, MPI_DOUBLE, otherRank, allProcessDisp[otherRank], 1, MPI_DOUBLE, win);
        // and end with 
        MPI_Win_unlock_all(win);
        printf("Rank 0 : Got %.2f from %d\n", otherRankData, otherRank);
    }

    if (rank == 1) {
        MPI_Win_lock_all(0, win);
        MPI_Put(myArray.data, ARRAY_SIZE, MPI_DOUBLE, 0, allProcessDisp[0], ARRAY_SIZE, MPI_DOUBLE, win);
        // and end with 
        MPI_Win_unlock_all(win);
    }

    printf("Rank %d: ", rank);
    for (i = 0; i < ARRAY_SIZE; i++)
        printf("%.2f ", myArray.data[i]);
    printf("\n");

    //set rank 0 array
    if (rank == 0) {
        for (i = 0; i < ARRAY_SIZE; i++)
            myArray.data[i] = -1.0;

        printf("Rank %d: ", rank);
        for (i = 0; i < ARRAY_SIZE; i++)
            printf("%.2f ", myArray.data[i]);
        printf("\n");
    }

    free(allProcessDisp);
    free(requestArray);
    free(myArray.data);

    MPI_Win_detach(win, myArray.data);
    MPI_Win_free(&win);
    MPI_Finalize();

    return 0;
}