I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?
So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.
Thanks!