I am writing an MPI program and the MPI_Bcast function is very slow on one particular machine I am using. In order to narrow down the problem, I have the following two test programs. The first does many MPI_Send/MPI_Recv operations from process 0 to the others:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#define N 1000000000
int main(int argc, char** argv) {
int rank, size;
/* initialize MPI */
MPI_Init(&argc, &argv);
/* get the rank (process id) and size (number of processes) */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* have process 0 do many sends */
if (rank == 0) {
int i, j;
for (i = 0; i < N; i++) {
for (j = 1; j < size; j++) {
if (MPI_Send(&i, 1, MPI_INT, j, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("Error!\n");
exit(0);
}
}
}
}
/* have the rest receive that many values */
else {
int i;
for (i = 0; i < N; i++) {
int value;
if (MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE) != MPI_SUCCESS) {
printf("Error!\n");
exit(0);
}
}
}
/* quit MPI */
MPI_Finalize( );
return 0;
}
This program runs in only 2.7 seconds or so with 4 processes.
This next program does exactly the same thing, except it uses MPI_Bcast to send the values from process 0 to the other processes:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#define N 1000000000
int main(int argc, char** argv) {
int rank, size;
/* initialize MPI */
MPI_Init(&argc, &argv);
/* get the rank (process id) and size (number of processes) */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* have process 0 do many sends */
if (rank == 0) {
int i, j;
for (i = 0; i < N; i++) {
if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("FAIL\n");
exit(0);
}
}
}
/* have the rest receive that many values */
else {
int i;
for (i = 0; i < N; i++) {
if (MPI_Bcast(&i, 1, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS) {
printf("FAIL\n");
exit(0);
}
}
}
/* quit MPI */
MPI_Finalize( );
return 0;
}
Both programs have the same value for N, and neither program returns an error from the communication calls. The second program should be at least a little bit faster. But it is not, it is much slower at roughly 34 seconds - around 12X slower!
This problem only manifests itself on one machine, but not others even though they are running the same operating system (Ubuntu) and don't have drastically different hardware. Also, I'm using OpenMPI on both.
I'm really pulling my hair out, does anyone have an idea?
Thanks for reading!