1
votes

I am trying to learn OpenMP however the code run slower than not using openMP. There has been various posting about this but none seems to apply to my issues. I have created a simple program that illustrate the point for the use of 'omp parallel for' when running it I got the following performance.

   No OMP 0.0109663sec
   Parallel for: 0.0076869sec single thread
   Parallel for: 0.0151231sec 2 threads
   Parallel for: 0.0169528sec 4 threads
   Parallel for: 0.0150955sec 8 threads

using 2 to 8 cores is roughly half the performance of not using openMP. Clearly this is not what I expected.

I am using visual studio express 2015. and it doesn't matter if you run it with optimizer on or off. I have set the /openmp in the c compiler command line. and I believe I have set the shared and private claused correctly. I am initializing an array with 1,000,000 entries, so any initial overhead to setting up the parallel threads should not be the issue. I have an Intel i7, with 8 cores

Code: I have two function testparrallelfor and testnoomp(). function Naming should be self-explanatory. the statement ++th[omp_get_thread_num()]; is just to count how many loop counts each thread is getting. The result is the same even if I comment that statement out. I have also tried to use a static variable double a[1000*1000] to see if the issue is with the dynamic heap allocation of variable a.

#include <omp.h>

static int th[8];

void reset_th()
{
    int i;
    for (i = 0; i < 8; ++i)
        th[i] = -1;
}

void out_th()
{
    int i;
    cout << "Threads ";
    for (i = 0; i < 8; ++i)
        cout << i << ":" << th[i] + 1 << ", ";
    cout << endl;
}

void testparallelfor(int len, int no)
    {
    const int n = 1000 * 1000;
    double tw;
    double *a = new double[n];

    reset_th();
    tw = omp_get_wtime();
#pragma omp parallel shared(a, len, th) num_threads(no) if (len > 1000)
    {
#pragma omp for 
        for (int la = 0; la < len; ++la)
        {
            ++th[omp_get_thread_num()];
            a[la] = la * 2 + 1; 
        }
    }

    tw = omp_get_wtime() - tw;
    cout << "Parallel for " << tw << "sec" << endl;
    out_th();
    }

void testnoomp(int len)
{
    int n = 1000 * 1000;
    double tw;
    double *a = new double[n];

    reset_th();
    tw = omp_get_wtime();
    for (int la = 0; la < len; ++la)
        {
        ++th[omp_get_thread_num()];
        a[la] = la * 2 + 1; 
        }

    tw = omp_get_wtime() - tw;
    cout << "No OMP " << tw << "sec" << endl;
    out_th();
}

int main()
    {
    int n = 1000*1000;

    testnoomp(n);               // no OpenMP 
    for(int i=1; i<=8; i*=2)
        testparallelfor(n, i);   // is is the number of threads to be sued 

    cout << endl;
    return 0;
    }

Any help or insight would be appreciated.

2

2 Answers

2
votes

At first glance, some notes/observations:

  • Updating int th[8] from multiple threads will lead to false sharing. This can lead to a massive slowdown on x86 architectures. Spread the values by at least 64 bytes.

  • The workload is too short in relation to parallelization overhead. Especially on the first run, OpenMP will lazily start additional threads in the pool.

  • The workload is too simple and mainly memory-bound. Memory bandwidth is limited and shared between all threads. As the working set increases and no longer fits in the cache, parallelizing it will lead to even further slow down due to cache thrashing.

  • A good optimizing compiler may optimize away the loop and/or the calculations since they are never used (fortunately MSVC doesn't in this case).

  • Pay attention to the number of actual cores vs. Hyper-Threaded processors. Hyper-Threading does not double CPU capacity.

If I compensate for false sharing, increase workload complexity (throw a sqrt in there) and size (iterate 50x times), then I do see some increase in performance.

Test results (tested on MSVC 2019 x64 /O2 /fp:fast):

No OMP 0.413487sec
1 Parallel for 0.440291sec
2 Parallel for 0.217796sec
4 Parallel for 0.108129sec
8 Parallel for 0.0959285sec

At 8 threads the speedup becomes negligible. That's because my system (i7-7700) has 4 cores, and hyper-threading doesn't help with operations like sqrt due to a single FP execution unit per core.

My adjusted version:

#include <cmath>
#include <iostream>
#include <omp.h>
using namespace std;

struct tsint {
  alignas(64) int x; // spread elements by 64 bytes
};
static tsint th[8];

static constexpr int n = 1000 * 1000;

void reset_th() {
  int i;
  for (i = 0; i < 8; ++i)
    th[i].x = -1;
}

void out_th() {
  int i;
  cout << "Threads ";
  for (i = 0; i < 8; ++i)
    cout << i << ":" << th[i].x  + 1 << ", ";
  cout << endl;
}

void testparallelfor(int m, int no) {
  double tw;
  double* a = new double[n]();

  reset_th();
  tw = omp_get_wtime();
#pragma omp parallel num_threads(no) // no need to specify shared vars, shared is the default
  {
#pragma omp for
    for (int la = 0; la < n*m; ++la) {
      ++th[omp_get_thread_num()].x;
      a[la % n] = sqrt(la * 2.1) + 1;       // heavier math
    }
  }

  tw = omp_get_wtime() - tw;
  cout << no << " Parallel for " << tw << "sec" << endl;
  out_th();
}

void testnoomp(int m) {
  double tw;
  double* a = new double[n]();

  reset_th();
  tw = omp_get_wtime();
  for (int la = 0; la < n*m; ++la) {
    ++th[omp_get_thread_num()].x;
    a[la % n] = sqrt(la * 2.1) + 1;         // heavier math
  }

  tw = omp_get_wtime() - tw;
  cout << "No OMP " << tw << "sec" << endl;
  out_th();
}

int main(int argc, char** argv) {
  int m = argc + 49; // m is number of iterations (50, but compiler doesn't know)
  testnoomp(m); // no OpenMP

  for (int i = 1; i <= 8; i *= 2)
    testparallelfor(m, i); // i is the number of threads to be issued

  // repeat the test again now that thread pool is hot
  for (int i = 1; i <= 8; i *= 2)
    testparallelfor(m, i); // i is the number of threads to be issued

  cout << endl;
  return 0;
}

Final note:

  • try compiling with /arch:AVX2 for auto-vectorization, it should double the speed (but then memory bandwidth will be even more of an issue for the parallel version).
1
votes
  1. working with the shared array th is a great way to slowdown any mutilthreaded code. Data is shared in cachlines which are 64 bytes long. If two cores work on the same cacheline it slows their operation even if they work on different parts of the cacheline. This phenomena is known as "false sharing".

  2. I don't know much about omp. I generally don't have much trust in any built-in compilation that generates any multi-threading. For instance, if most of your threads advance simultaneously then they all modify data on the same cacheline - which is not very productive. Is compliler even aware of it?

  3. Your code doesn't really do any calculations. It primarily stores data. Most modern systems have limited memory bus. A single core can fully or almost fully overload it. You might need a server grade processor or something to have a significantly better memory bus.