openmp slower than expected on msvc using ordered pragma

Question

I'm trying to speed up my png encoder using openmp (omp ordered pragma) which works consistently great with GCC (MinGW). On MSVC the resulting speedup varies wildly. The problem size that I am working on means the parallelized for loop consists of roughly six (6) iterations. Running this on a 4C/8T machine I would expect roughly two "waves" of threads being executed. This should take roughly two times as long as a single iteration on a single core. Again, this is roughly what I see with GCC but not with MSVC (takes roughly 3-4x as long).

I was able to distill a simple example that shows the same behavior. Playing around with some parameters (number of iterations, computation time etc.) I found that MSVC performs mostly inconsistent. Adding more iterations helps with consistency a bit which makes sense but that is not something I can do in the original problem.

#include <windows.h>
#include <stdint.h>
#include <stdio.h>

// Timer stuff...
int64_t get_ts() {
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return li.QuadPart;
}
double get_time_ms(int64_t prev) {
    LARGE_INTEGER li;
    QueryPerformanceFrequency(&li);
    double frequency = (double)li.QuadPart;
    QueryPerformanceCounter(&li);
    return (double)(li.QuadPart-prev)/(frequency*1E-3);
}
#define TIMING_START int64_t _start = get_ts()
#define TIMING_END printf("In %s: %.02fms\n", __FUNCTION__, get_time_ms(_start));



// Takes roughly 1.7ms on an old i7
void mySlowFunc() {
    //TIMING_START;
    volatile int a = 0;
    for(int j = 0; j < 1000001; j++) {
        a += j;
    }
    //TIMING_END;
}

#define NUM_ITER 6
int main(int argc, char* argv[]) {

    // baseline for comparison
    printf("===== Call on single core:\n");
    for(int i = 0; i < 5; i++) {
        TIMING_START;
        for(int i = 0; i < NUM_ITER; i++) {
            mySlowFunc();
        }
        TIMING_END;
    }

    printf("===== Call on multiple cores: %d\n", omp_get_max_threads());
    for(int i = 0; i < 5; i++) {
        TIMING_START;

#pragma omp parallel
{
        int y;
#pragma omp for ordered
        for(y = 0; y < NUM_ITER; y++) {
            mySlowFunc();

#pragma omp ordered
            {
                volatile int i = 0;
            }
        }
}
        TIMING_END;
    }

    return 0;
}

The following example outputs are on the same machine, same OS (Win10, Intel i7 860)

Example output MinGW (-O3 -fopenmp -march=native -mtune=native):

===== Call on single core:
In main: 10.57ms // 6 iterations @1.7ms each; this performs as expected
In main: 10.42ms
In main: 10.57ms
In main: 10.59ms
In main: 10.36ms
===== Call on multiple cores: 8
In main: 4.44ms
In main: 3.53ms 
In main: 3.06ms
In main: 3.16ms // roughly 3x increase with 4C/8T. Seems reasonable
In main: 3.10ms

Example output MSVC (/MD /O2 /Ob2 /openmp):

===== Call on single core:
In misc_ordered: 10.49ms
In misc_ordered: 10.43ms
In misc_ordered: 10.45ms
In misc_ordered: 11.29ms
In misc_ordered: 10.36ms
===== Call on multiple cores: 8
In misc_ordered: 3.29ms // expected
In misc_ordered: 4.02ms
In misc_ordered: 6.26ms // why??? >:-(
In misc_ordered: 6.27ms
In misc_ordered: 6.21ms

Again: Note that multiple runs with MSVC randomly give results between 3 and 6ms.

This looks to me like the openmp implementation in MSVC struggles to distribute the workload evenly. But shouldn't the windows scheduler do this anyway? And if so, why would it behave differently on two different executables? Alternatively maybe some threads are waiting (there is still ordering happening) but how could I attempt to verify and fix that?

Gregor Budweiser Gregor Budweiser · Accepted Answer · 2019-05-14T20:55:30

It turns out that scheduling was indeed different between MinGW and MSVC.

The default scheduling in OpenMP is defined by the implementation
For this problem a static schedule gives best results

Explicitly setting a static schedule and limiting the number of threads by the number of iterations yields consistent results across compilers:

#pragma omp parallel
{
        int y;
        omp_set_num_threads(min(omp_get_max_threads(), NUM_ITER));
#pragma omp for ordered schedule(static, 1)
        for(y = 0; y < NUM_ITER; y++) {
            mySlowFunc();

#pragma omp ordered
            {
                volatile int i = 0;
            }
        }
}

openmp slower than expected on msvc using ordered pragma

1 Answers