0
votes

I am writing an MPI program and the MPI_Bcast function is very slow on one particular machine I am using. In order to narrow down the problem, I have the following two test programs. The first does many MPI_Send/MPI_Recv operations from process 0 to the others:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

#define N 1000000000

int main(int argc, char** argv) {
  int rank, size;

  /* initialize MPI */
  MPI_Init(&argc, &argv);

  /* get the rank (process id) and size (number of processes) */
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  /* have process 0 do many sends */
  if (rank == 0) {
    int i, j;
    for (i = 0; i < N; i++) {
      for (j = 1; j < size; j++) {
        if (MPI_Send(&i, 1, MPI_INT, j, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
          printf("Error!\n");
          exit(0);
        }   
      }   
    }   
  }   

  /* have the rest receive that many values */
  else {
    int i;
    for (i = 0; i < N; i++) {
      int value;
      if (MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE) != MPI_SUCCESS) {
        printf("Error!\n");
        exit(0);
      }   
    }   
  }   

  /* quit MPI */
  MPI_Finalize( );
  return 0;
}

This program runs in only 2.7 seconds or so with 4 processes.

This next program does exactly the same thing, except it uses MPI_Bcast to send the values from process 0 to the other processes:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

#define N 1000000000

int main(int argc, char** argv) {
  int rank, size;

  /* initialize MPI */
  MPI_Init(&argc, &argv);

  /* get the rank (process id) and size (number of processes) */
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  /* have process 0 do many sends */
  if (rank == 0) {
    int i, j;
    for (i = 0; i < N; i++) {
      if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
        printf("FAIL\n");
        exit(0);
      }   
    }   
  }   

  /* have the rest receive that many values */
  else {
    int i;
    for (i = 0; i < N; i++) {
      if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
        printf("FAIL\n");
        exit(0);
      }   
    }   
  }   

  /* quit MPI */
  MPI_Finalize( );
  return 0;
}

Both programs have the same value for N, and neither program returns an error from the communication calls. The second program should be at least a little bit faster. But it is not, it is much slower at roughly 34 seconds - around 12X slower!

This problem only manifests itself on one machine, but not others even though they are running the same operating system (Ubuntu) and don't have drastically different hardware. Also, I'm using OpenMPI on both.

I'm really pulling my hair out, does anyone have an idea?

Thanks for reading!

1

1 Answers

2
votes

A couple of observations.

The MPI_Bcast is receiving the result into the "&i" buffer. The MPI_Recv is receiving the result into "&value". Is there some reason that decision was made?

The Send/Recv model will naturally synchronize. The MPI_Send calls are blocking and serialized. The matching MPI_Recv should always be ready when the MPI_Send is called.

In general, collectives tend to have larger advantages as the job size scales up.

I compiled and ran the programs using IBM Platform MPI. I lowered the N value by 100x to 10 Million, to speed up the testing. I changed the MPI_Bcast to receive the result in a "&value" buffer rather than into the "&i" buffer. I ran each case three times, and averaged the times. The times are the "real" value returned by "time" (this was necessary as the ranks were running remotely from the mpirun command).

With 4 ranks over shared memory, the Send/Recv model took 6.5 seconds, the Bcast model took 7.6 seconds.

With 32 ranks (8/node x 4 nodes, FDR InfiniBand), the Send/Recv model took 79 seconds, the Bcast model took 22 seconds.

With 128 ranks (16/node x 8 nodes, FDR Infiniband), the Send/Recv model took 134 seconds, the Bcast model took 44 seconds.

Given these timings AFTER the reduction in the N value by 100x to 10000000, I am going to suggest that the "2.7 second" time was a no-op. Double check that some actual work was done.