Splitting LAPACK calls with OpenMP

Question

I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?

So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.

Thanks!

Hristo Iliev Hristo Iliev · Accepted Answer · 2013-01-18T13:55:32

The closest thing that you can do is to have an OpenMP parallel region executing with a team of two threads and then call MKL from each thread. You have to enable nested parallelism in MKL (by disabling dynamic threads), fix the number of MKL threads to 6 and have to use Intel's compiler suite to compile your code. MKL itself is threaded using OpenMP but it's Intel's OpenMP runtime. If you happen to use another compiler, e.g. GCC, its OpenMP runtime might prove incompatible with Intel's.

As you haven't specified the language, I provide two examples - one in Fortran and one in C/C++:

Fortran:

call mkl_set_num_threads(6)
call mkl_set_dynamic(0)

!$omp parallel sections num_threads(2)
!$omp section
   call dgemm(...)
!$omp end section
!$omp section
   call dgemm(...)
!$omp end section
!$omp end parallel sections

C/C++:

mkl_set_num_threads(6);
mkl_set_dynamic(0);

#pragma omp parallel sections num_threads(2)
{
    #pragma omp section
    {
        cblas_dgemm(...)
    }
    #pragma omp section
    {
        cblas_dgemm(...)
    }
}

In general you cannot create subsets of threads for MKL (at least given my current understanding). Each DGEMM call would use the globally specified number of MKL threads. Note that MKL operations might tune for the cache size of the CPU and performing two matrix multiplications in parallel might not be beneficial. You might be if you have a NUMA system with two hexacore CPUs, each with its own memory controller (which I suspect is your case), but you have to take care of where data is being placed and also enable binding (pinning) of threads to cores.

Splitting LAPACK calls with OpenMP

1 Answers