0
votes

I am testing a node with three Intel Xeon Phi cards. My idea is to use OpenMP 4.0 directives to offload tasks on the coprocessors. The code is as follows (it is taken from http://goo.gl/9Ztq0e):

/***************************************************************************************************
* FILE          : openmp4x-reduce-1Darray.c
*
* INPUT         : Nil
*
* OUTPUT        : Displays Host and device reduce sum
*
* CREATED       : August,2013
*
* EMAIL         : hpcfte@cdac.in
*
***************************************************************************************************/

#include <stdio.h>

 #define SIZE 10000
 #pragma omp declare target

int reduce(int *inarray)
{

  int sum = 0;
  #pragma omp target map(inarray[0:SIZE]) map(sum)
  {
    for(int i=0;i<SIZE;i++)
    sum += inarray[i];
  }
  return sum;
}

int main()
{
  int inarray[SIZE], sum, validSum;

  validSum=0;
  for(int i=0; i<SIZE; i++){
  inarray[i]=i;
  validSum+=i;
  }

 sum=0;
 sum = reduce(inarray);

 printf("sum reduction = %d,validSum=%d\n",sum, validSum);
}

I compiled it with intel/16.0.1.150 compiler (I read on Intel site that this compiler supports OpenMP 4.0, maybe I am wrong). In addition to this I used the variables:

export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=granularity=fine,compact

icc -openmp -std=c99 -qopt-report2 openmp_4.0_reduce_1Darray.c -o exec

The issue is when I run the code, then I use micsmc-gui (graphical interface) to see the performance of the cores on coprocessors. What I don't understand is why only one core seems to be used on each coprocessor independently of the number of threads I use on MIC, see the red rectangle on each MIC in the figure.

performance of MICs

Any suggestion?

Thanks.

1
That is to be expected. There are no worksharing constructs in your code, so it is running serially on the co-processor.Hristo Iliev

1 Answers

1
votes

You have not specified any parallel directive, so the loop is sequential. Try to add an openmp parrallel directive in order to distribute the iterations of the loop on the muliple cores of the MIC

    int reduce(int *inarray)
    {

      int sum = 0;
      #pragma omp target map(inarray[0:SIZE]) map(sum)
      {
        #pragma omp parallel for reduction(+:sum)
        for(int i=0;i<SIZE;i++)
          sum += inarray[i];
      }
      return sum;
    }

Some basic documentation: https://computing.llnl.gov/tutorials/openMP/#DO