I'm trying to speed up my png encoder using openmp (omp ordered pragma) which works consistently great with GCC (MinGW). On MSVC the resulting speedup varies wildly. The problem size that I am working on means the parallelized for loop consists of roughly six (6) iterations. Running this on a 4C/8T machine I would expect roughly two "waves" of threads being executed. This should take roughly two times as long as a single iteration on a single core. Again, this is roughly what I see with GCC but not with MSVC (takes roughly 3-4x as long).
I was able to distill a simple example that shows the same behavior. Playing around with some parameters (number of iterations, computation time etc.) I found that MSVC performs mostly inconsistent. Adding more iterations helps with consistency a bit which makes sense but that is not something I can do in the original problem.
#include <windows.h>
#include <stdint.h>
#include <stdio.h>
// Timer stuff...
int64_t get_ts() {
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return li.QuadPart;
}
double get_time_ms(int64_t prev) {
LARGE_INTEGER li;
QueryPerformanceFrequency(&li);
double frequency = (double)li.QuadPart;
QueryPerformanceCounter(&li);
return (double)(li.QuadPart-prev)/(frequency*1E-3);
}
#define TIMING_START int64_t _start = get_ts()
#define TIMING_END printf("In %s: %.02fms\n", __FUNCTION__, get_time_ms(_start));
// Takes roughly 1.7ms on an old i7
void mySlowFunc() {
//TIMING_START;
volatile int a = 0;
for(int j = 0; j < 1000001; j++) {
a += j;
}
//TIMING_END;
}
#define NUM_ITER 6
int main(int argc, char* argv[]) {
// baseline for comparison
printf("===== Call on single core:\n");
for(int i = 0; i < 5; i++) {
TIMING_START;
for(int i = 0; i < NUM_ITER; i++) {
mySlowFunc();
}
TIMING_END;
}
printf("===== Call on multiple cores: %d\n", omp_get_max_threads());
for(int i = 0; i < 5; i++) {
TIMING_START;
#pragma omp parallel
{
int y;
#pragma omp for ordered
for(y = 0; y < NUM_ITER; y++) {
mySlowFunc();
#pragma omp ordered
{
volatile int i = 0;
}
}
}
TIMING_END;
}
return 0;
}
The following example outputs are on the same machine, same OS (Win10, Intel i7 860)
Example output MinGW (-O3 -fopenmp -march=native -mtune=native):
===== Call on single core:
In main: 10.57ms // 6 iterations @1.7ms each; this performs as expected
In main: 10.42ms
In main: 10.57ms
In main: 10.59ms
In main: 10.36ms
===== Call on multiple cores: 8
In main: 4.44ms
In main: 3.53ms
In main: 3.06ms
In main: 3.16ms // roughly 3x increase with 4C/8T. Seems reasonable
In main: 3.10ms
Example output MSVC (/MD /O2 /Ob2 /openmp):
===== Call on single core:
In misc_ordered: 10.49ms
In misc_ordered: 10.43ms
In misc_ordered: 10.45ms
In misc_ordered: 11.29ms
In misc_ordered: 10.36ms
===== Call on multiple cores: 8
In misc_ordered: 3.29ms // expected
In misc_ordered: 4.02ms
In misc_ordered: 6.26ms // why??? >:-(
In misc_ordered: 6.27ms
In misc_ordered: 6.21ms
Again: Note that multiple runs with MSVC randomly give results between 3 and 6ms.
This looks to me like the openmp implementation in MSVC struggles to distribute the workload evenly. But shouldn't the windows scheduler do this anyway? And if so, why would it behave differently on two different executables? Alternatively maybe some threads are waiting (there is still ordering happening) but how could I attempt to verify and fix that?