I want to call a library which utilizes OpenMP-parallelization from within a program which itself runs parallel via MPI. If I just run my MPI program using a single process, then when the time comes to call out to the OpenMP library, 7 additional threads (corresponding to the number of cores on my machine) are spawned correctly and the task is carried out in parallel. If I instead run my MPI program on 2 processes and let each process call out to the OpenMP program, each of them spawns their own threads instead of working together as before, making the computation take much longer.
I have tried to only let the MPI master process call the OpenMP library while the other process(es) wait, but then these processes (physical cores) do not participate in the OpenMP computation at all.
Do I have to somehow tell the MPI program that it should now launch an OpenMP program collectively? A further complication is the fact that I run the MPI program on a cluster with multiple nodes. It would be acceptable to only launch the OpenMP program on the node which contain the MPI master process.
To be specific, my MPI program is written in Cython and uses mpi4py. I use MPICH as the MPI implementation, but hopefully this is not important. The OpenMP program is written in C and I call it through a Cython wrapper.