0
votes

System configure:

Workstation with two Xeon E5-2620 V4 CPUs. Cent OS 7.3.

Openmpi-3.0.1, ifort 2015, gcc 4.8.6, intel MKL.

I run an MPI/OpenMP hybrid program on a Workstation. I want to use 1 MPI process with 8 OpenMP threads. However, the OpenMP threads used in the parallel region is always 1. In another machine with an Intel 9900K CPU, the number of OpenMP threads is always 2. For both machines, I have printed the OMP_NUM_THREADS by calling omp_get_max_threads. OMP_NUM_THREADS is 8 since I already set "export OMP_NUM_THREADS=8". It is really bothering.

After digging about one day, I realized it was related to the Openmpi parameter "-bind-to". If "-bind-to none" or "-bind-to numa" is used, the program works fine since CPU usage for each MPI process is 800% and 8 times speedup is obtained in the parallel region. If I use the default value that is "-bind-to core". The number of OpenMP threads is always not what I expected. Therefore, the workstation with Xeon 2640, we have disabled hyper-threading so the real used openmp threads are always 1. For PC with Intel i7-9900K, hyper-threading is enabled so the number of OMP threads used is 2.

Moreover, if I do not use "-bind-to none/numa" parameters, omp_get_num_procs return 1. If "-bind-to none" is used, omp_get_num_procs returns number of processors(CPU cores) while using "-bind-to numa", omp_get_num_procs return number of CPU cores in one CPU.

I post my experience here that may helpful for other people to have a similar problem.

1
I'm voting to close this question as off-topic because this is not a question but a blog postGilles Gouaillardet
binding MPI and OpenMP is a two steps tango. Bind MPI tasks to a set of cores, and then bind OpenMP within this subset. When running 2 tasks (or less IIRC) with Open MPI, the default is to bind a task to a core. When running more tasks, the default is to bind a task to a NUMA domain. I am a bit surprised though that you ended with one OpenMP threads instead of the 8 requested OpenMP threads (that would do time sharing if bound on a single core).Gilles Gouaillardet
"I am a bit surprised though that you ended with one OpenMP threads instead of the 8 requested OpenMP threads (that would do time sharing if bound on a single core). " I think you are right. It should be 8 threads sharing 1 core. For me, I just print OMP_NUM_THREADS before the parallel region. This number is 8. I only know the CPU usage is 100%.Bingbing

1 Answers

1
votes

As pointed by Gilles in comment, there are two places in hybrid MPI + OpenMP runs to work with cpu sets. First one is MPI library and mpirun (mpiexec) program, which does MPI process distribution on available nodes and their available CPUs (with help of hwloc). Then every started MPI process will have some allowed set of logical CPU cores to work on, and OpenMP (multithreading) library will try to work on these available resources. OpenMP library (gcc's libgomp or intel's & llvm's openmprtl.org) may check set of allowed cores to decide thread count to use.

When you want "N MPI process with K OpenMP threads" you should check that your mpirun will give K allowed cores for every MPI process. And mpirun will not respect your OMP_NUM_THREADS env variable.

OpenMPI's mpirun has useful option --report-bindings to see what is the allowed set in actual run. It also has many (not so easy to use) options to change bindings, described in man page https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php

You can try mpirun -n 1 --report-bindings true to see actual bindings without starting your program. For 1 MPI process and 8 cores per process try option "--cpus-per-proc <#perproc> - Bind each process to the specified number of cpus."

 mpirun -n 1 --cpus-per-proc 8 --report-bindings true

This option is deprecated, but it still may work and it is much easier than "--map-by <obj>:PE=n". (I don't fully understand this variant now. Probably --map-by socket or --bind-by socket may help you to bind mpi process to all available cores of the CPU chip)

To find out mapping of logical cpu cores to hyperthreading and real cpu cores you may use hwloc library and its tool lstopo https://www.open-mpi.org/projects/hwloc/doc/v2.0.4/a00312.php#cli_examples