openmpi: how to send a not connected datablock from one rank to all other ranks?

Question

i am looking for a MPI function/method that allows to deliver multiple data blocks from one process to all others. Similar to MPI_Bcast but with multiple blocks at the same time?

I have a fragmented data block on the root rank:

#define BLOCKS 5
#define BLOCKSIZE 10000

char *datablock[BLOCKS];
int i;
for (i=0; i<BLOCKS; i++) datablock[i] = (char*)malloc(BLOCKSIZE*sizeof(char))

this is just an example but it is clear that the the BLOCKS are not necessary adjacent. I want this datablock delivered to all other ranks (where i allready have prepared the necessary memory to store it).

I noticed that there are methods like MPI_Gatherv or MPI_Scatterv which allow to gather or scatter fragmented data using a displacement array, the problem is that scatter sends each fragment to a different rank, i need to send all fragments to all other ranks, something like a MPI_Bcast with displacment information, like MPI_Bcastv.

One solution would be to have multiple MPI_Bcast calls (one for each block) but i am not sure if this is the best way to do this.

UPDATE: I will try the MPI_Ibcast method here is want i think should work:

int rank; // rank id
int blocksize = 10000;
int blocknum = 200;
char **datablock = NULL;
char *recvblock = NULL;
MPI_Request *request;
request = (MPI_Request *)malloc(blocknum*sizeof(MPI_Request));
if(rank == 0) {
    // this is just an example in practice those blocks are created one the fly as soon as the last block is filled
    datablock = (char**)malloc(blocknum*sizeof(char*));
    for (i=0; i<BLOCKS; i++) datablock[i] = (char*)malloc(blocksize*sizeof(char));
    for (i=0; i<blocknum; i++)
        MPI_Ibcast(datablock[i], blocksize, MPI_CHAR, 0, MPI_COMM_WORLD, request[i]);
} else {
    // for this example the other threads know allreay how many blocks the rank 0 has created, in practice this information is broadcasted via MPI before the MPI_Ibcast call
    recvblock = (*char)malloc(blocksize*blocknum*sizeof(char));
    for (i=0; i<blocknum; i++)
        MPI_Ibcast(recvblock+i*(blocksize), blocksize, MPI_CHAR, 0, MPI_COMM_WORLD, request[i]);
}
MPI_Waitall(blocknum, request, MPI_STATUSES_IGNORE);

So, there is a MPI_Waitall missing, i am not sure how to use it, there is a count, an array of requests and an array of statuses needed!?

The reason i have a different MPI_Ibcast for the root and the other ranks is, that the send buffer is not identical with the receive buffer.

Another question is, do i need a different request for each of the MPI_Ibcasts in the for loops, or can i reuse the MPI_request variable as i have done in the example above?

UPDATE2: So i have updated the example, i use a MPI_Request pointer now! Which i then initialize by the malloc call right after the defintion, this seems pretty odd i guess but this is just an example and in practice the number of requests needed is only known during runtime. I am especially worried if i can use sizeof(MPI_Request) here or if this is problematic because this is not standard data type?

Apart from that, is the example correct? Is it a good solution if i want to use MPI_Ibcast?

francis francis · Accepted Answer · 2016-02-26T07:22:50

Would serialization be a good idea ? For instance, you can copy the multiple buffers into a single one, broadcast it and then unpack it on the receiver side. It is the way boost.mpi handles complex objects (in c++)
Alternatively, you can use multiple calls to the non-blocking version of MPI_Bcast(): MPI_Ibcast(), followed by a single call to MPI_Waitall().

Notice that the data you are describing looks like a 2D array. There is a way to allocate it in a different way, so that the whole data is contigous in memory:

int block=42;
int blocksize=42;
char **array=malloc(block*sizeof(char*));
if(array==NULL){fprintf(stderr,"malloc failed\n";exit(1);}
array[0]=malloc(block*blocksize*sizeof(char));
if(array[0]==NULL){fprintf(stderr,"malloc failed\n";exit(1);}
int i;
for(i=1;i<block;i++){
    array[i]=&array[0][i*blocksize];
}

Then, a single call to MPI_Bcast() is sufficient to broacast the whole array:

MPI_Bcast(array[0], block*blocksize, MPI_CHAR,0, MPI_COMM_WORLD);

EDIT: Here is a solution based on your code, to be compiled by mpicc main.c -o main -Wall and ran by mpirun -np 4 main:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc,char *argv[])
{

    int  size, rank;
    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&size);    

    int i;
    int blocksize = 10000;
    int blocknum = 200;
    char **datablock = NULL;
    char *recvblock = NULL;
    MPI_Request requests[blocknum];
    MPI_Status status[blocknum];

    if(rank == 0) {
        // this is just an example in practice those blocks are created one the fly as soon as the last block is filled
        datablock = malloc(blocknum*sizeof(char*));
        if(datablock==NULL){fprintf(stderr,"malloc failed\n"); exit(1);}
        for (i=0; i<blocknum; i++){
            datablock[i] = (char*)malloc(blocksize*sizeof(char));
            if(datablock[i]==NULL){fprintf(stderr,"malloc failed\n"); exit(1);}
            datablock[i][0]=i%64;
        }
        for (i=0; i<blocknum; i++)
            MPI_Ibcast(datablock[i], blocksize, MPI_CHAR, 0, MPI_COMM_WORLD, &requests[i]);


    } else {
        // for this example the other threads know allreay how many blocks the rank 0 has created, in practice this information is broadcasted via MPI before the MPI_Ibcast call
        recvblock = malloc(blocksize*blocknum*sizeof(char));
        if(recvblock==NULL){fprintf(stderr,"malloc failed\n"); exit(1);}
        for (i=0; i<blocknum; i++)
            MPI_Ibcast(recvblock+i*(blocksize), blocksize, MPI_CHAR, 0, MPI_COMM_WORLD, &requests[i]);
    }

    int ierr=MPI_Waitall(blocknum, requests, status); 
    if(ierr!=MPI_SUCCESS){fprintf(stderr,"MPI_Waitall() failed rank %d\n",rank);exit(1);}


    if(rank==0){
        for(i=0;i<blocknum;i++){
            free(datablock[i]);
        }
        free(datablock);
    }else{
        for(i=0;i<blocknum;i++){
            if(recvblock[i*(blocksize)]!=i%64){
                printf("communcation problem ! %d %d %d\n",rank,i, recvblock[i*(blocksize)]);
            }
        }
        free(recvblock);
    }

    MPI_Finalize();
    return 0;
}

I believe that an optimal implementation would a mix between serialization and MPI_IBcast() to limit both the memory footprint and he number of messages.

openmpi: how to send a not connected datablock from one rank to all other ranks?

2 Answers