[Background: OpenMP v4+ on Intel's icc compiler]
I want to parallelize tasks inside a loop that is already parallelized. I saw quite a few queries on subjects close to this one, e.g.:
- Parallel sections in OpenMP using a loop
- Doing a section with one thread and a for-loop with multiple threads
- and others with more concentrated wisdom still.
but I could not get a definite answer other than a compile time error message when trying it.
Code:
#pragma omp parallel for private(a,bd) reduction(+:sum)
for (int i=0; i<128; i++) {
a = i%2;
for (int j=a; j<128; j=j+2) {
u_n = 0.25 * ( u[ i*128 + (j-3) ]+
u[ i*128 + (j+3) ]+
u[ (i-1)*128 + j ]+
u[ (i+1)*128 + j ]);
// #pragma omp single nowait
// {
// #pragma omp task shared(sum1) firstprivate(i,j)
// sum1 = (u[i*128+(j-3)]+u[i*128+(j-2)] + u[i*128+(j-1)])/3;
// #pragma omp task shared(sum2) firstprivate(i,j)
// sum2 = (u[i*128+(j+3)]+u[i*128+(j+2)]+u[i*128+(j+1)])/3;
// #pragma omp task shared(sum3) firstprivate(i,j)
// sum3 = (u[(i-1)*128+j]+u[(i-2)*128+j]+u[(i-3)*128+j])/3;
// #pragma omp task shared(sum4) firstprivate(i,j)
// sum4 = (u[(i+1)*128+j]+u[(i+2)*128+j]+u[(i+3)*128+j])/3;
// }
// #pragma omp taskwait
// {
// u_n = 0.25*(sum1+sum2+sum3+sum4);
// }
bd = u_n - u[i*128+ j];
sum += diff * diff;
u[i*128+j]=u_n;
}
}
In the above code, I tried replacing the u_n = 0.25 *(...);
line with the 15 commented lines, to try not only to paralllelize the iterations over the 2 for
loops, but also to acheive a degree of parallelism on each of the 4 calculations (sum1
to sum4
) involving array u[]
.
The compile error is fairly explicit:
error: the OpenMP "single" pragma must not be enclosed by the "parallel for" pragma
Is there a way around this so I can optimize that calculation further with OpenMP ?