1
votes

I'm using c++ with MPI to perform some linear algebra calculations, like the eigenvalue decomposition. These calculations are completely local to each process, so I thought the single-process performance should not be influenced by the total number of processes I run, as long as there are enough computational resources.

However, it turns out that, as the total number of processes increases, the performance of each process decreases. On a node consists of 2 Intel Xeon Gold 6132 CPUs (a total of 28 physical cores, or 56 threads), my tests find that an eigen-decomposition of a 2000-by-2000 symmetric matrix takes about 1.1 seconds for a single process, 1.3 seconds for 4 independent processes (with mpirun -np 4 ./test), and 1.8 seconds for 12 processes.

I wonder, is this an expected behavior for MPI, or did I miss some binding options? I've tried "mpirun -np 12 --bind-to core:12 ./test" but it does not help. I'm using the Armadillo library and it's linked with Intel MKL. The environment variable MKL_NUM_THREADS is set to be 1. The source code is attached.

#include <mpi.h>
#include <armadillo>
#include <chrono>
#include <sstream>

using namespace arma;
using iclock = std::chrono::high_resolution_clock;

int main(int, char**argv) {

    ////////////////////////////////////////////////////
    //              MPI Initialization
    ////////////////////////////////////////////////////
    int id, nprocs;
    MPI_Init(nullptr, nullptr);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    ////////////////////////////////////////////////////
    //              parse arguments
    ////////////////////////////////////////////////////
    int sz = 0, nt = 0;
    std::stringstream ss; 

    if (id == 0) {
        ss << argv[1];
        ss >> sz; 
        ss.clear();
        ss.str("");

        ss << argv[2];
        ss >> nt; 
        ss.clear();
        ss.str("");
    }   

    MPI_Bcast(&sz, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&nt, 1, MPI_INT, 0, MPI_COMM_WORLD);

    ////////////////////////////////////////////////////
    //                test and timing
    ////////////////////////////////////////////////////
    mat a = randu(sz, sz);
    a += a.t();

    mat evec(sz, sz);
    vec eval(sz);

    iclock::time_point start = iclock::now();

    for (int i = 0; i != nt; ++i) {
        //evec = a*a;
        eig_sym(eval, evec, a); // <-------here
    }   

    std::chrono::duration<double> dur = iclock::now() - start;

    double t = dur.count() / nt;

    ////////////////////////////////////////////////////
    //               collect timing
    ////////////////////////////////////////////////////
    vec durs(nprocs);
    MPI_Gather(&t, 1, MPI_DOUBLE, durs.memptr(), 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    if (id == 0) {
        std::cout << "average time elapsed of each proc:" << std::endl;
        durs.print();
    }

    MPI_Finalize();

    return 0;
}

2
You should provide the value of sz that you use when running the tests. Computational resources are not the only one shared by multiple processes. Last-level caches and memory bandwidth are limited too.Hristo Iliev
Building on what @HristoIliev says, armadillo is likely using bindings to LAPACK and/or BLAS for the actual linear algebra, which will use knowledge about the cache sizes of your machine to improve memory throughput. Running multiple processes in parallel means you have much more cache contention, and lower overall throughput.bnaecker

2 Answers

0
votes

This is an expected behaviour. To get performance for MPI, you have to perform data decomposition (load balancing), communication optimisation (blocking vs non blocking) and so on.

As I understood from the question, 12 processes are doing the calculation on 12 2000X2000 Matrices and averag time is 1.8 seconds, while a single process performing computation on average is only taking 1.2 second.

Yes, for the above scenario, MPI performance will not be better than a single process and must be higher due to the following reasons ( some of them are mentioned by Hristo Iliev in comments):

  1. Overhead induced by MPI
  2. Time taken by the slowest process in the MPI
  3. Memory bandwidth (Each processes need to access 2000*2000 matrix, possibly resulting in a contention).
  4. Caching (Higher level caches are shared among the processors and it can be possible that frequent cache access by multiple processes etc can impact the overall application performance)

Also, performance improvement (speedup) will be based on the parallel part of the application and since there are no parallel parts in your application, you won't observe any benefits of the parallelisation.

Also, if a 2000X2000 matrix is distributed among 12 processes we cannot guarantee that the performance of MPI will be better than single process. It will be implementation dependent.

0
votes

Are you dividing your total elapsed time by the number of processes, or are you accounting for scheduling overhead? The work scheduler in your runtime will require some overhead processing time that increases with the ratio of number of processes to number of cores on your machine. You may need to reduce your parallelism granularity (number of processes per processor) in order to optimize speed. This is expected behavior under normal conditions.

However, the conditions you have set are not normal. Setting MKL_NUM_THREADS=1 prevents spawning of more than 1 thread! Delete the line that sets MLK_NUM_THREADS and the system will take care of it for you.