0
votes

I want to call a library which utilizes OpenMP-parallelization from within a program which itself runs parallel via MPI. If I just run my MPI program using a single process, then when the time comes to call out to the OpenMP library, 7 additional threads (corresponding to the number of cores on my machine) are spawned correctly and the task is carried out in parallel. If I instead run my MPI program on 2 processes and let each process call out to the OpenMP program, each of them spawns their own threads instead of working together as before, making the computation take much longer.

I have tried to only let the MPI master process call the OpenMP library while the other process(es) wait, but then these processes (physical cores) do not participate in the OpenMP computation at all.

Do I have to somehow tell the MPI program that it should now launch an OpenMP program collectively? A further complication is the fact that I run the MPI program on a cluster with multiple nodes. It would be acceptable to only launch the OpenMP program on the node which contain the MPI master process.

To be specific, my MPI program is written in Cython and uses mpi4py. I use MPICH as the MPI implementation, but hopefully this is not important. The OpenMP program is written in C and I call it through a Cython wrapper.

1
Your question is confusing. The title mentions "OpenMP program", but from the text it would appear that you are calling library functions that use OpenMP. Please clarify as both are very different things. - Hristo Iliev
Your choice of MPI is important, as the more widely used choices of MPI (at least on linux) provide automatic means for pinning OpenMP threads to separate cores. Otherwise, your OpenMP processes probably need to run on separate nodes, unless you go to the trouble of writing scripts to set affinities and use an OpenMP which supports affinity. - tim18
@HristoIliev I guess I am using a library that use OpenMP then. - jmd_dk

1 Answers

0
votes

I found the solution.

The call to the OpenMP library should only be done by a single MPI process. It is no good to insert a standard MPI barrier after this call, as such a barrier takes up 100% of the CPU time on the slave processes, leaving no extra work force available to the OpenMP library. Instead we have to write our own barrier function which pings the master process periodically to ask whether the OpenMP call is completed yet. Between two such pings, the slave processes sleep for a given interval of time, which then means that they are free to participate in the OpenMP computation.

An example of this logic is implemented in Python as follows, hopefully with obvious meanings of variable names.

def sleeping_barrier(sleep_time=0.1):
    if master:
        # Signal slaves to continue
        for slave in range(1, nprocs):
            isend(True, dest=slave)
    else:
        # Wait for master
        while True:
            sleep(sleep_time)
            if iprobe():
                recv()  # Remember to receive the message
                break

# Do OpenMP library call
if master:
    call_openmp_lib()
sleeping_barrier()