If we look at the Visual C++ documentation of omp_set_dynamic
, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):
If [the function argument] evaluates to a nonzero value, the number of threads that are used for executing upcoming parallel regions may be adjusted automatically by the run-time environment to best use system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the
omp_get_num_threads
function.
It seems clear that omp_set_dynamic(1)
allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads
inside parallel regions.
(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);
. It appears that "the number of threads specified by the user" does not refer to dynamic_threads
but instead means "whatever the user specified using the remaining OpenMP interface").
However, no matter how high I push my system load under omp_set_dynamic(1)
, the return value of omp_get_num_threads
(queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1)
and omp_set_dynamic(0)
.
Here is a sample program to reproduce the issue:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>
#include <omp.h>
#define UNDER_LOAD true
const int SET_DYNAMIC_TO = 1;
const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;
std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;
void oneRegion(int i)
{
// Pesudo-randomize the number of iterations.
unsigned ui = static_cast<unsigned>(i);
int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);
#pragma omp parallel for schedule(guided, 512)
for (int j = 0; j < count; ++j)
{
if (j == 0)
{
threadNumSum += omp_get_num_threads();
threadNumCount++;
}
if ((j + i + count) % 16 != 0)
continue;
// Do some floating point math.
double a = j + i;
for (int k = 0; k < 10; ++k)
a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));
volatile double out = a;
}
}
int main()
{
omp_set_dynamic(SET_DYNAMIC_TO);
#if UNDER_LOAD
for (int i = 0; i < 10; ++i)
{
std::thread([]()
{
unsigned x = 0;
float y = static_cast<float>(std::sqrt(2));
while (true)
{
//#pragma omp parallel for
for (int i = 0; i < 100000; ++i)
{
x = x * 7 + 13;
y = 4 * y * (1 - y);
}
volatile unsigned xx = x;
volatile float yy = y;
}
}).detach();
}
#endif
std::chrono::high_resolution_clock clk;
auto start = clk.now();
for (int i = 0; i < REPEATS; ++i)
oneRegion(i);
std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;
double averageThreadNum = double(threadNumSum) / threadNumCount;
std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;
std::getchar();
return 0;
}
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64
On e.g. gcc, this program will print a significantly lower averageThreadNum
for omp_set_dynamic(1)
than for omp_set_dynamic(0)
. But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).
How can this be explained?