3
votes

I am trying to run a code using hybrid MPI-OpenMP parallelization. According to my knowledge as long as the number of OpenMP threads is less than the number of physical processors, each processor is running one thread. Assuming this is true, suppose I have a hypothetical computing node consisting of two computing cards. Each computing card has chips with 4 processors + memory. My question is: What would be the optimal choice of MPI and OpenMP parameters. I would say 2 MPI jobs and 4 threads each, is this correct?

OMP_NUM_THREADS = 4 mpirun -np 2 code

I heard from some colleagues that those parameters should be carefully chosen, to get the best performance (depending on the hardware layout). I would appreciate some advice on running hybrid jobs.

Thanks

1

1 Answers

3
votes

The choice of the correct parallelization configuration for a real application code is never trivial. The optimal mapping of MPI processes and OpenMP threads onto a multiprocessor node depends on the specific implementation of the algorithm, the OpenMP runtime, the internal organization of the cache memory hierarchy and other factors related to the processor architecture.

Therefore users are advised to run different configurations on their specific hardware to find the optimal assignment. You could find a number of reports on such studies among technical reports of research computing facilities and HPC consultancies.

On an m x n node where m is the number of processor sockets and n is the number of CPU cores such an experiment would involve running the code for all possible integral values of the number of MPI processes p and OpenMP threads q such that p x q = m x n for each available compiler.

Here is a plot of the parallel speedup obtained for different combinations of p and q for a 4 x 12 AMD Opteron node. Data taken from HiPERiSM Consulting LLC technical report HCTR-2011-2 by George Delic, 2010. Parallel speedup for different numbers of MPI processes and OpenMP threads. Data taken from HiPERiSM Consulting LLC technical report HCTR-2011-2 by George Delic, 2010 You can see that for this particular code an processor architecture the optimal number of OpenMP threads per MPI process is 1. However the case of 4 threads and 12 MPI processes came close second.