Special case of collapse OpenMP

Question

I have a potentially simple question, but looking at SO I couldn't find any questions that asked quite the same thing. My question is: will the collapse clause in the OpenMP code below properly handle both inner loops? Or will it only collapse with the first inner loop?

!$omp parallel do collapse(2) private(iy, ix, iz)
do iy = 1, ny
    do ix = 1, nx
       ! stuff
    enddo
    do iz = 1, nz
       ! different stuff
    enddo
enddo
!$omp end parallel do

This code compiles for me and obviously shows benefits of parallelization. However, I know that the standard says:

All loops associated with the loop construct must be perfectly nested; that is, there must be no intervening code nor any OpenMP directive between any two loops.

So my gut reaction is that OpenMP is only collapsing the first inner loop (ix). But then how is it handling the second inner loop (iz)?

I am obviously attempting the code to do the following, but it is much uglier and verbose to write the code this way:

!$omp parallel private(iy, ix, iz)
!$omp do collapse(2)
do iy = 1, ny
    do ix = 1, nx
       ! stuff
    enddo
enddo
!$omp end do nowait

!$omp do collapse(2)
do iy = 1, ny
    do iz = 1, nz
       ! different stuff
    enddo
enddo
!$omp end do nowait
!$omp end parallel do

nowait on the second do does nothing since parallel end has barrier semantics. — Jeff Hammond

Davislor Davislor · Accepted Answer · 2015-09-11T23:13:16

The first inner loop is code intervening between the outer loop and the second inner loop (as I understand it). If nz≠nx, you don’t have rectangular loops. In any case, the program semantics are that the first inner loop must complete before the second inner loop begins; it might perform intermediate calculations that the second loop uses. A given implementation of OpenMP might do what you want—I haven’t attempted to test this.

Note that the second example changes the semantics of the program: all the ix loops execute, followed by all the iz loops, rather than each ix loop followed by each iz loop for the same value of iy. This should be safe if you could parallelize the ix loop, as you can only do that if none of the ix computations depend on any iz computation, but might not be as efficient if the iz loops are going to re-use the same data. So the correct semantics are going to depend on what needs to happen before a given loop can run. Do the iz loops need the ix loops to have run first for the same value of iy? If not, you might be able to use nested parallelism.

Note on Loop Collapsing: Loop collapsing usually means you take a nested pair of loops, such as,

for (i=0;i<100;++i)
  for (j=0;j<50;++j)

And turn them into a single loop like:

for (ij=0;ij<5000;++ij)

If you have two different inner loops with different indices, you cannot do this, and furthermore, the compiler can’t automatically change the order of execution as proposed because that changes program semantics. I’m not sure what every OpenMP implementation does with this code, but I’m pretty sure that it doesn’t work the way you were hoping.

Special case of collapse OpenMP

1 Answers