I use Intel C++ compiler 17.0.01, and I have two code blocks.
The first code block allocates memory on Xeon Phi like this:
#pragma offload target(mic:1) nocopy(data[0:size]: alloc_if(1) free_if(0))
The second block evaluates the above memory and copies it back to the host:
#pragma offload target(mic:1) out(data[0:size]: alloc_if(0) free_if(0))
This code runs just fine but the #pragma offload is part of Intel's compiler only (I think). So, I want to convert that to OpenMP.
This is how I translated the first block to OpenMP:
#pragma omp target device(1) map(alloc:data[0:size])
And this is how I translated the second block to OpenMP:
#pragma omp target device(1) map(from:data[0:size])
Also, I used export OFFLOAD_REPORT=2
in order to get a better idea on what is going on during the runtime.
Here are my problems/questions:
- The OpenMP version of the first code block is as fast as the Intel version (
#pragma offload
). Nothing strange here. - The OpenMP version of the second code block is 5 times slower than the Intel version. However, the
MIC_TIME
of the two is the same, but theCPU_TIME
is different (OpenMP version much higher). Why is that? - Is my Intel directives optimal?
- Is my Intel -> OpenMP translation correct and optimal?
And here are some other, a bit different, questions:
- On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this:
#pragma omp target device(1)...
. Is that correct? - If I do
#pragma omp target device(5)...
the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that? - I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the
target device(1)
is ignored? - Is it possible to do something like
std::cout << print_phi_card_name_or_uid();
inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?