13
votes

I have edited my question after previous comments (especially @Zboson) for better readability

I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)

Here is my test code:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
    const int n = 256*8192*100;
    double *A, *B;
    posix_memalign((void**)&A, 64, n*sizeof(double));
    posix_memalign((void**)&B, 64, n*sizeof(double));
    for (int i = 0; i < n; ++i) {
        A[i] = 0.1;
        B[i] = 0.0;
    }
    double start = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        B[i] = exp(A[i]) + sin(B[i]);
    }
    double end = omp_get_wtime();
    double sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += B[i];
    }
    printf("%g %g\n", end - start, sum);
    return 0;
}

When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]: Performance as a function of thread count

The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1: Speed up as a function of thread count

The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.

Even more weirdly, the speed-up is 35 times!

Can anyone explain this?

I also tested this with much smaller arrays 8192*4, and see similar performance scaling.

In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):

for i in {1..128}; do
    for k in {1..5}; do
        export OMP_NUM_THREADS=$i;
        echo -ne $i $k "";
        ./a.out;
    done;
done > out

EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves! Crazy numbers of threads

EDIT: I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

correct scaling

1
It may be how indeed OpenMP is assigning the threads, have you tried 3 threads just out of curiosity? It could be that when moving from 1 to 2, that it is assigning both threads to a single ACTUAL core, but because you are truly trying to utilize the same resources within that single core, that it really isn't helping! When moving to 4, you are truly utilizing 2 actual cores (maybe). Also, what happens if you use 8 threads, so we can see what happens when we move from (hopefully) a hyperthread situation to a full core situation + hyperthreads?trumpetlicks
@trumpetlicks I added the timings you wanted.jtravs
Also, if you to multiple runs of each (with the exception of the single case), what do the timings come out to. I think that OpenMP and the OS randomly assign to core # (or in your case it could be assigning to a HT or actual core).trumpetlicks
where you are changing the no. of threads used?Devavrata
@Neuron by using the OMP_NUM_THREADS environment variablejtravs

1 Answers

9
votes

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

#include <immintrin.h>

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    _mm256_zeroupper();
    B[i] = sin(B[i]);
    _mm256_zeroupper();
    B[i] += exp(A[i]);       
}

Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.