MPI2 / MPI3: MPI_allgather vs MPI one sided communication considering synchronization

Question

I am writing an optimization program using MPI-2, in which I need to have a std::vector of equal length std::vectors (conceptually), shared among all processes. This vector holds the best k solutions to the problem currently found, and is updated each time a new best solution is found by one of the many MPI processes. The time spent by each process in finding a new solution usually varies a lot.

My question is, considering performance issues in synchronization and waiting,whether I should use MPI collectives such as MPI_allgather each time a new best solution is found; or should I use One-Sided-Communications in MPI-2 to maintain a "shared" vector among all processes.

In particular, if I use MPI_allgather, will processes finishing their jobs early idle and wait for some kind of synchronization with other processes?

I have some working experience with MPI point-to-point communication (upd: as well as UPC), but haven't used collectives or one sided communication in actual coding. I searched SO and found relevant questions/answers about MPI_allgathers, e.g. Distribute a structure using MPI_Allgather , and about one-sided-communication Creating a counter that stays synchronized across MPI processes. But I am having trouble telling the exact difference between the two approaches.

Thanks,

--- Update ---

In particular, I have the code example at bottom from Creating a counter that stays synchronized across MPI processes, which uses one-sided to maintain a single int "shared". I tried to adapt it to work for a generic type, but don't know how to make it work as I have trouble understand the original code and why it maintains an array data, and how I could generalize MPI_Accumulate to a user function (like simply replacing the old vector with a new one).

template //note: T can only be primitive types (not pointer, ref or struct) such as int and double. struct mpi_array { typedef std::vector Vector; MPI_Win win; int hostrank;
int rank;
int size;
Vector val;
Vector *hostvals; };

One sided comm counter code:

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

struct mpi_counter_t {
    MPI_Win win;
    int  hostrank ;
    int  myval;
    int *data;
    int rank, size;
};

struct mpi_counter_t *create_counter(int hostrank) {
    struct mpi_counter_t *count;

    count = (struct mpi_counter_t *)malloc(sizeof(struct mpi_counter_t));
    count->hostrank = hostrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &(count->rank));
    MPI_Comm_size(MPI_COMM_WORLD, &(count->size));

    if (count->rank == hostrank) {
        MPI_Alloc_mem(count->size * sizeof(int), MPI_INFO_NULL, &(count->data));
        for (int i=0; i<count->size; i++) count->data[i] = 0;
        MPI_Win_create(count->data, count->size * sizeof(int), sizeof(int),
                       MPI_INFO_NULL, MPI_COMM_WORLD, &(count->win));
    } else {
        count->data = NULL;
        MPI_Win_create(count->data, 0, 1,
                       MPI_INFO_NULL, MPI_COMM_WORLD, &(count->win));
    }
    count -> myval = 0;

    return count;
}

int increment_counter(struct mpi_counter_t *count, int increment) {
    int *vals = (int *)malloc( count->size * sizeof(int) );
    int val;

    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, count->hostrank, 0, count->win);

    for (int i=0; i<count->size; i++) {

        if (i == count->rank) {
            MPI_Accumulate(&increment, 1, MPI_INT, 0, i, 1, MPI_INT, MPI_SUM,
                           count->win);
        } else {
            MPI_Get(&vals[i], 1, MPI_INT, 0, i, 1, MPI_INT, count->win);
        }
    }

    MPI_Win_unlock(0, count->win);
    count->myval += increment;

    vals[count->rank] = count->myval;
    val = 0;
    for (int i=0; i<count->size; i++)
        val += vals[i];

    free(vals);
    return val;
}

void delete_counter(struct mpi_counter_t **count) {
    if ((*count)->rank == (*count)->hostrank) {
        MPI_Free_mem((*count)->data);
    }
    MPI_Win_free(&((*count)->win));
    free((*count));
    *count = NULL;

    return;
}

void print_counter(struct mpi_counter_t *count) {
    if (count->rank == count->hostrank) {
        for (int i=0; i<count->size; i++) {
            printf("%2d ", count->data[i]);
        }
        puts("");
    }
}

int test1() {
    struct mpi_counter_t *c;
    int rank;
    int result;

    c = create_counter(0);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    result = increment_counter(c, 1);
    printf("%d got counter %d\n", rank, result);

    MPI_Barrier(MPI_COMM_WORLD);
    print_counter(c);
    delete_counter(&c);
}


int test2() {
    const int WORKITEMS=50;

    struct mpi_counter_t *c;
    int rank;
    int result = 0;

    c = create_counter(0);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    srandom(rank);

    while (result < WORKITEMS) {
        result = increment_counter(c, 1);
        if (result <= WORKITEMS) {
             printf("%d working on item %d...\n", rank, result);
             sleep(random() % 10);
         } else {
             printf("%d done\n", rank);
         }
    }

    MPI_Barrier(MPI_COMM_WORLD);
    print_counter(c);
    delete_counter(&c);
}

int main(int argc, char **argv) {

    MPI_Init(&argc, &argv);

    test1();
    test2();

    MPI_Finalize();
}

You cannot generalize MPI_Accumulate to arbitrary (i.e. not built-in) types because that is not supported by MPI-3 (it often goes by the name active-messages, for which you might try GASNet). — Jeff Hammond

Wesley Bland Wesley Bland · Accepted Answer · 2014-07-11T17:41:10

Your concern that some processes might enter the MPI_ALLGATHER before others is valid, but that's always the case in any application with synchronization, not just those that explicitly use collective communication.

However, it appears you might have a misunderstanding about what the one-sided operations do. They don't provide a Parallel Global Address Space (PGAS) model where everything is synchronized for you. Instead, they just give you a way to directly address the memory of remote processes. Each process's memory is still separate. Also, if you're going to be upgrading from point to point to the rest of MPI, I wouldn't limit yourself to just the MPI-2 functions. There is some new stuff in MPI-3 that also improves both collectives and one-sided (especially the latter).

All that being said, if you've never used anything but point-to-point, one-sided is going to be a big jump for you. You might want to go for more of an intermediate step and check out collectives first anyway. If you're still not happy with your performance, you can take a look at the one-sided chapter, but it's very complex and most people usually end up using something that sits on top of one-sided rather than using it directly (like some of the PGAS languages perhaps).

MPI2 / MPI3: MPI_allgather vs MPI one sided communication considering synchronization

1 Answers