Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

Question

If we look at the Visual C++ documentation of omp_set_dynamic, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):

If [the function argument] evaluates to a nonzero value, the number of threads that are used for executing upcoming parallel regions may be adjusted automatically by the run-time environment to best use system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function.

It seems clear that omp_set_dynamic(1) allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads inside parallel regions.

^{(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);. It appears that "the number of threads specified by the user" does not refer to dynamic_threads but instead means "whatever the user specified using the remaining OpenMP interface").}

However, no matter how high I push my system load under omp_set_dynamic(1), the return value of omp_get_num_threads (queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1) and omp_set_dynamic(0).

Here is a sample program to reproduce the issue:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>

#include <omp.h>

#define UNDER_LOAD true

const int SET_DYNAMIC_TO = 1;

const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;

std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;

void oneRegion(int i)
{
  // Pesudo-randomize the number of iterations.
  unsigned ui = static_cast<unsigned>(i);
  int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);

#pragma omp parallel for schedule(guided, 512)
  for (int j = 0; j < count; ++j)
  {
    if (j == 0)
    {
      threadNumSum += omp_get_num_threads();
      threadNumCount++;
    }

    if ((j + i + count) % 16 != 0)
      continue;

    // Do some floating point math.
    double a = j + i;
    for (int k = 0; k < 10; ++k)
      a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));

    volatile double out = a;
  }
}


int main()
{
  omp_set_dynamic(SET_DYNAMIC_TO);


#if UNDER_LOAD
  for (int i = 0; i < 10; ++i)
  {
    std::thread([]()
    {
      unsigned x = 0;
      float y = static_cast<float>(std::sqrt(2));
      while (true)
      {
//#pragma omp parallel for
        for (int i = 0; i < 100000; ++i)
        {
          x = x * 7 + 13;
          y = 4 * y * (1 - y);
        }
        volatile unsigned xx = x;
        volatile float yy = y;
      }
    }).detach();
  }
#endif


  std::chrono::high_resolution_clock clk;
  auto start = clk.now();

  for (int i = 0; i < REPEATS; ++i)
    oneRegion(i);

  std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;

  double averageThreadNum = double(threadNumSum) / threadNumCount;
  std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;

  std::getchar();

  return 0;
}

Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64

On e.g. gcc, this program will print a significantly lower averageThreadNum for omp_set_dynamic(1) than for omp_set_dynamic(0). But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).

How can this be explained?

Max Langhof Max Langhof · Accepted Answer · 2019-05-09T09:07:49

In Visual C++, the number of threads executing the loop does get reduced with omp_set_dynamic(1) in this example, which explains the performance difference.

However, contrary to any good-faith interpretation of the standard (and Visual C++ docs), omp_get_num_threads does not report this reduction.

The only way to figure out how many threads MSVC actually uses for each parallel region is to inspect omp_get_thread_num on every loop iteration (or parallel task). The following would be one way to do it with little in-loop performance overhead:

// std::hardware_destructive_interference_size is not available in gcc or clang, also see comments by Peter Cordes:
// https://stackguides.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons
struct alignas(2 * std::hardware_destructive_interference_size) NoFalseSharing
{
    int flagValue = 0;
};

void foo()
{
  std::vector<NoFalseSharing> flags(omp_get_max_threads());

#pragma omp parallel for
  for (int j = 0; j < count; ++j)
  {
    flags[omp_get_thread_num()].flagValue = 1;

    // Your real loop body
  }

  int realOmpNumThreads = 0;
  for (auto flag : flags)
    realOmpNumThreads += flag.flagValue;
}

Indeed, you will find realOmpNumThreads to yield significantly different values from the omp_get_num_threads() inside the parallel region with omp_set_dynamic(1) on Visual C++.

One could argue that technically

"the number of threads in the team executing a parallel region" and
"the number of threads that are used for executing upcoming parallel regions"

are not literally the same.

This is a nonsensical interpretation of the standard in my view, because the intent is very clear and there is no reason for the standard to say "The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function" in this section if this number is unrelated to the functionality of omp_set_dynamic.

However, it could be that MSVC decided to keep the number of threads in a team unaffected and just assign no loop iterations for execution to a subset of them under omp_set_dynamic(1) for ease of implementation.

Whatever the case may be: Do not trust omp_get_num_threads in Visual C++.

Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

1 Answers