OpenACC CPU vs GPU optimization

Question

I am noob in OpenACC, I try to optimize code, for CPU i get:

Time = Time + omp_get_wtime();
    {
      #pragma acc parallel loop
      for (int i = 1;i < k-1; i++)
      {
        jcount[i]=((int)(MLT[i]/dt))+1;
      }
      jcount[0]=0;
      jcount[k-1]=N;

          #pragma acc parallel loop collapse(2)
            for (int i = 0;i < k - 1; i++)
            {
                for(int j=jcount[i];j < jcount[i+1];j++)
                {
                    w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
                    X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
                    Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
                }
            }
    }
Time = omp_get_wtime() - Time;

For my Intel I7(I turned off Hyper-Threading) with 6 cores I get bad parallelization, difference between 6 vs 1 cores only 30%(It means that 70% of code run sequentially,but I don't see where)

For GPU:

...
    acc_init( acc_device_nvidia );
...
TimeGPU = TimeGPU + omp_get_wtime();
    {
      #pragma acc kernels loop independent  copyout(jcount[0:k]) copyin(MLT[0:k],dt)
      for (int i = 1;i < k-1; i++)
      {
        jcount[i]=((int)(MLT[i]/dt))+1;
      }
      jcount[0]=0;
      jcount[k-1]=N;

          #pragma acc kernels loop independent copyout(X[0:N+1],Y[0:N+1]) copyin(MLT[0:k],MLX[0:k],MLY[0:k],dt) copy(w[0:N])
            for (int i = 0;i < k - 1; i++)
            {
                for(int j=jcount[i];j < jcount[i+1];j++)
                {
                    w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
                    X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
                    Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
                }
            }
    }
TimeGPU = omp_get_wtime() - TimeGPU;

And GPU(gtx1070) 3 times slower than 6 cores processor!

Launch parameters:
GPU: pgc++ -ta=tesla:cuda9.0 -Minfo=accel -O4
CPU: pgc++ -ta=multicore -Minfo=accel -O4

k = 20000,N = 2millions

UPDATE:

change GPU code:

TimeGPU = TimeGPU + omp_get_wtime();
#pragma acc data create(jcount[0:k],w[0:N]) copyout(X[0:N+1],Y[0:N+1]) copyin(MLT[0:k],MLX[0:k],MLY[0:k],dt)
    {
      #pragma acc parallel loop
      for (int i = 1;i < k-1; i++)
      {
        jcount[i]=((int)(MLT[i]/dt))+1;
      }
      jcount[0]=0;
      jcount[k-1]=N;

          #pragma acc parallel loop
            for (int i = 0;i < k - 1; i++)
            {
                for(int j=jcount[i];j < jcount[i+1];j++)
                {
                    w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
                    X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
                    Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
                }
            }
    }
TimeGPU = omp_get_wtime() - TimeGPU;
    Launch parameters:
    pgc++ -ta=tesla:managed:cuda9.0 -Minfo=accel -O4

Now GPU 2 times slower that CPU

Output:

139: compute region reached 1 time
        139: kernel launched 1 time
            grid: [157]  block: [128]
             device time(us): total=425 max=425 min=425 avg=425
            elapsed time(us): total=509 max=509 min=509 avg=509
    139: data region reached 2 times
        139: data copyin transfers: 1
             device time(us): total=13 max=13 min=13 avg=13
    146: compute region reached 1 time
        146: kernel launched 1 time
            grid: [157]  block: [128]
             device time(us): total=13,173 max=13,173 min=13,173 avg=13,173
            elapsed time(us): total=13,212 max=13,212 min=13,212 avg=13,212

Why I get TimeGPU 2 times bigger compare with Output using PGI_ACC_TIME=1? (30ms vs 14ms)

Mat Colgrove Mat Colgrove · Accepted Answer · 2018-06-11T20:04:00

I'm thinking a lot of the GPU time is dues to poor memory access of you're kernels. Ideally you want vectors to access contiguous data.

How many iterations is the "j" loop? If longer then 32, then you might try adding a "#pragma acc loop vector" on it so it will be parallelized across the vectors and give you better data access.

Also, you have a lot of redundant memory fetches. Consider setting the values from the arrays with the "i" indexes to temp variables so the values are fetched only once from memory.

OpenACC CPU vs GPU optimization

1 Answers