1
votes

[Background: OpenMP v4+ on Intel's icc compiler]

I want to parallelize tasks inside a loop that is already parallelized. I saw quite a few queries on subjects close to this one, e.g.:

but I could not get a definite answer other than a compile time error message when trying it.

Code:

 #pragma omp parallel for private(a,bd) reduction(+:sum)
    for (int i=0; i<128; i++) {
        a = i%2;
        for (int j=a; j<128; j=j+2) {
             u_n = 0.25 * ( u[ i*128 + (j-3) ]+
                            u[ i*128 + (j+3) ]+
                            u[ (i-1)*128 + j ]+
                            u[ (i+1)*128 + j ]);
          // #pragma omp single nowait 
          // {
          // #pragma omp task shared(sum1) firstprivate(i,j)
          // sum1 = (u[i*128+(j-3)]+u[i*128+(j-2)] + u[i*128+(j-1)])/3;
          // #pragma omp task shared(sum2) firstprivate(i,j)
          // sum2 = (u[i*128+(j+3)]+u[i*128+(j+2)]+u[i*128+(j+1)])/3; 
          // #pragma omp task shared(sum3) firstprivate(i,j)
          // sum3 = (u[(i-1)*128+j]+u[(i-2)*128+j]+u[(i-3)*128+j])/3;
          // #pragma omp task shared(sum4) firstprivate(i,j)
          // sum4 = (u[(i+1)*128+j]+u[(i+2)*128+j]+u[(i+3)*128+j])/3;
          // }
          // #pragma omp taskwait 
          // {
          // u_n = 0.25*(sum1+sum2+sum3+sum4);
          // }
             bd = u_n - u[i*128+ j];
             sum += diff * diff;
             u[i*128+j]=u_n;
       }    
  }

In the above code, I tried replacing the u_n = 0.25 *(...); line with the 15 commented lines, to try not only to paralllelize the iterations over the 2 for loops, but also to acheive a degree of parallelism on each of the 4 calculations (sum1 to sum4) involving array u[].

The compile error is fairly explicit:

error: the OpenMP "single" pragma must not be enclosed by the "parallel for" pragma

Is there a way around this so I can optimize that calculation further with OpenMP ?

1

1 Answers

1
votes

The single worksharing construct within a loop worksharing construct is prohibited by the standard, but you don't need it there.

The usual parallel -> single -> task setup for tasking is to ensure that you have a thread team setup for your tasks (parallel), but then only spawn each task once (single). You wouldn't need the latter in a parallel for context because each iteration is already executed only once. So you could spawn tasks directly within the loop. This seems to have the expected behavior on both gnu and Intel compilers, i.e. threads that have completed their own loop iterations do help other threads to execute their tasks.

However, that is a bad idea to do in your case. A tiny computation such as the one of sum1 will be much faster on it's own compared to the overhead of spawning a task.

Removing all pragmas except for the parallel for, this is a very reasonable parallelization. Before further optimizing the calculation, you should measure! In particularly, you are interested in whether all your available threads are always computing something, or whether some threads finish early and wait for others (load imbalance). To measure, you should look for a parallel performance analysis tool for your platform. If that is the case, you can address it with scheduling policies, or possibly by nested parallelism in the inner loop.

A full discussion of the performance of your code is more complex, and requires a minimal, complete and verifiable example, a detailed system description, and actual measured performance numbers.