Nested loop in OpenMP performance issue

Question

I have such a uninformative nested loops (just as test of performance):

const int N = 300;

for (int num = 0; num < 10000; num++) {
    for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
        arr[i][j] = brr[i][j];
        crr[i][j] = arr[i][j] - brr[i][j];
        sum1 += crr[i][j];
        sum2 += arr[i][j];
        }
    }
}

The elapsed time was

about 6 s

I tried to parallelize different loops with OpenMP. But I am very confused with the results I got.

In the first step I used "parallel for" pragma only for the first (outermost) loop:

#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
        for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
        arr[i][j] = brr[i][j];
        crr[i][j] = arr[i][j] - brr[i][j];
        sum1 += crr[i][j];
        sum2 += arr[i][j];
        }
    }
}

The elapsed time was (2 cores)

3.81

Then I tried to parallelize two inner loops with "collapse" clause (2 cores):

for (int num = 0; num < 10000; num++) {
     #pragma omp parallel for collapse(2) schedule(static) reduction(+:sum1, sum2)
     for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
        arr[i][j] = brr[i][j];
        crr[i][j] = arr[i][j] - brr[i][j];
        sum1 += crr[i][j];
        sum2 += arr[i][j];
        }
    }
}

The elapsed time was

3.76

This is faster then in previous case. And I do not understand the reason of this.

If I use fusing of these inner loops (which is meant to be better in the sense of performance) like this

#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
 for (int n = 0; n < N * N; n++) {
     int i = n / N; int j = n % N;

the elapsed time is

5.53

This confuses me so much. The performance is worse in this case, though usually people advise to fuse loops for better performance.

Okay, now let's try to parallelize only middle loop like this (2 cores):

for (int num = 0; num < 10000; num++) {
        #pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
        for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
        arr[i][j] = brr[i][j];
        crr[i][j] = arr[i][j] - brr[i][j];
        sum1 += crr[i][j];
        sum2 += arr[i][j];
        }
    }
}

Again, the performance becomes better:

3.703

And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores):

for (int num = 0; num < 10000; num++) {
        for (int i=0; i<N; i++) {
            #pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
        for (int j=0; j<N; j++) {
        arr[i][j] = brr[i][j];
        crr[i][j] = arr[i][j] - brr[i][j];
        sum1 += crr[i][j];
        sum2 += arr[i][j];
        }
    }
}

But (surprise!) the elapsed time is

about 11 s

This is much slower than in previous cases. I cannot catch the reason of all of this.

By the way, I was looking for similar questions, and I found advice of adding

#pragma omp parallel

before the first loop (for example, in this and that questions). But why is it right procedure? If we place

#pragma omp parallel#

before for-loop it means that each thread executes for-loop completely, which is incorrect (excess work). Indeed, I tried to insert

#pragma omp parallel

before the outermost loop with different locations of

#pragma omp parallel for

as I am describing here, and the performance was worse in call cases (moreover, in the latest case when parallelizing the innermost loop only, answer was also incorrect (namely, "sum2" was different - as there was a race condition).

I would like to know the reasons of such a performance (probably the reason is that time of data exchange is greater than time of actual computation on each thread, but this is in the latest case) and what solution is the most correct one.

EDIT: I've disabled compiler's optimization (by $-O0$ option) and results still the same (except that time elapsed in the latest example (when parallelizing the innermost loop) reduced from 11 s to 8 s). Compiler options:

g++ -std=gnu++0x -fopenmp -O0 test.cpp

Definition of variables:

unsigned int seed;
const int N = 300;

int main()
{

double arr[N][N];
double brr[N][N];
for (int i=0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        arr[i][j] = i * j;
        brr[i][j] = i + j;
    }
}

double start = omp_get_wtime();
double crr[N][N];
double sum1 = 0;
double sum2 = 0;

It is surprisingly difficult to make codes just to test performance. The arr and crr assignments don't change anything after the first iteration - so the compiler might optimize it away. If they did, the outer loop parallelization would be incorrect. Discussing these particular loops won't help much in understanding loops where these arrays actually change. Also, please include an minimal reproducible example. In particular, the compiler options, compiler version and variable definitions are missing, which matters a whole lot. — Zulan
@Zulan I've disabled compiler's optimization (see edited question) and added variables' declaration. — newt

Jorge Bellon Jorge Bellon · Accepted Answer · 2016-12-01T23:09:02

And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)

But (surprise!) the elapsed time is:
about 11 s

It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).

In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).

For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).

When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).

The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.

From this point, you can now try to explore the other options.

Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.

Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).

Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.

#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
    #pragma omp for schedule(static) nowait
    for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
            arr[i][j] = brr[i][j];
            crr[i][j] = arr[i][j] - brr[i][j];
            sum1 += crr[i][j];
            sum2 += arr[i][j];
        }
    }
}

Note: L1 miss ratio computed using perf:

$  perf stat -e cache-references,cache-misses -r 3 ./test

Nested loop in OpenMP performance issue

2 Answers