I am testing a node with three Intel Xeon Phi cards. My idea is to use OpenMP 4.0 directives to offload tasks on the coprocessors. The code is as follows (it is taken from http://goo.gl/9Ztq0e):
/***************************************************************************************************
* FILE : openmp4x-reduce-1Darray.c
*
* INPUT : Nil
*
* OUTPUT : Displays Host and device reduce sum
*
* CREATED : August,2013
*
* EMAIL : hpcfte@cdac.in
*
***************************************************************************************************/
#include <stdio.h>
#define SIZE 10000
#pragma omp declare target
int reduce(int *inarray)
{
int sum = 0;
#pragma omp target map(inarray[0:SIZE]) map(sum)
{
for(int i=0;i<SIZE;i++)
sum += inarray[i];
}
return sum;
}
int main()
{
int inarray[SIZE], sum, validSum;
validSum=0;
for(int i=0; i<SIZE; i++){
inarray[i]=i;
validSum+=i;
}
sum=0;
sum = reduce(inarray);
printf("sum reduction = %d,validSum=%d\n",sum, validSum);
}
I compiled it with intel/16.0.1.150 compiler (I read on Intel site that this compiler supports OpenMP 4.0, maybe I am wrong). In addition to this I used the variables:
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=granularity=fine,compact
icc -openmp -std=c99 -qopt-report2 openmp_4.0_reduce_1Darray.c -o exec
The issue is when I run the code, then I use micsmc-gui (graphical interface) to see the performance of the cores on coprocessors. What I don't understand is why only one core seems to be used on each coprocessor independently of the number of threads I use on MIC, see the red rectangle on each MIC in the figure.
Any suggestion?
Thanks.