I'm using c++ with MPI to perform some linear algebra calculations, like the eigenvalue decomposition. These calculations are completely local to each process, so I thought the single-process performance should not be influenced by the total number of processes I run, as long as there are enough computational resources.
However, it turns out that, as the total number of processes increases, the performance of each process decreases. On a node consists of 2 Intel Xeon Gold 6132 CPUs (a total of 28 physical cores, or 56 threads), my tests find that an eigen-decomposition of a 2000-by-2000 symmetric matrix takes about 1.1 seconds for a single process, 1.3 seconds for 4 independent processes (with mpirun -np 4 ./test), and 1.8 seconds for 12 processes.
I wonder, is this an expected behavior for MPI, or did I miss some binding options? I've tried "mpirun -np 12 --bind-to core:12 ./test" but it does not help. I'm using the Armadillo library and it's linked with Intel MKL. The environment variable MKL_NUM_THREADS is set to be 1. The source code is attached.
#include <mpi.h>
#include <armadillo>
#include <chrono>
#include <sstream>
using namespace arma;
using iclock = std::chrono::high_resolution_clock;
int main(int, char**argv) {
////////////////////////////////////////////////////
// MPI Initialization
////////////////////////////////////////////////////
int id, nprocs;
MPI_Init(nullptr, nullptr);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
////////////////////////////////////////////////////
// parse arguments
////////////////////////////////////////////////////
int sz = 0, nt = 0;
std::stringstream ss;
if (id == 0) {
ss << argv[1];
ss >> sz;
ss.clear();
ss.str("");
ss << argv[2];
ss >> nt;
ss.clear();
ss.str("");
}
MPI_Bcast(&sz, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&nt, 1, MPI_INT, 0, MPI_COMM_WORLD);
////////////////////////////////////////////////////
// test and timing
////////////////////////////////////////////////////
mat a = randu(sz, sz);
a += a.t();
mat evec(sz, sz);
vec eval(sz);
iclock::time_point start = iclock::now();
for (int i = 0; i != nt; ++i) {
//evec = a*a;
eig_sym(eval, evec, a); // <-------here
}
std::chrono::duration<double> dur = iclock::now() - start;
double t = dur.count() / nt;
////////////////////////////////////////////////////
// collect timing
////////////////////////////////////////////////////
vec durs(nprocs);
MPI_Gather(&t, 1, MPI_DOUBLE, durs.memptr(), 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if (id == 0) {
std::cout << "average time elapsed of each proc:" << std::endl;
durs.print();
}
MPI_Finalize();
return 0;
}
sz
that you use when running the tests. Computational resources are not the only one shared by multiple processes. Last-level caches and memory bandwidth are limited too. – Hristo Iliev