OpenMP is not reducing run time even though multiple threads are running. How can this be

Question

I'm trying to do a multiplication of to larger matrices (1000x1000 to 5000x5000 double precision). I have to use OpenMP to parallelise the multiplication. The parallel for loop is processed by p number of threads and they are scheduled correctly I guess based on printing out omp_get_thread_num(). I'm running on a 4 core CPU and have confirmed that the max number of threads is 4. The CPU's are virtual if that makes any difference. The problem is that the run time doesn't decrease when I change the nb of threads.

lscpu results

I have checked that the libgomp library is installed by ldconfig -p | grep -i "gomp".
I have tried changing the place of the parallel loop to one of the nested loops.
I have tried changing the scheduling and chunk size.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>

double** createMatrix(int N)
{
  double** rndMatrix;
  srand48((long int)time(NULL));
  rndMatrix = malloc(sizeof(double*)*N);
  int n,m;

  for(n=0; n<N; n++){
      rndMatrix[n] = malloc(sizeof(double*)*N);
      for (m=0;m<N;m++){
          rndMatrix[n][m] = drand48();
      }
  }
  return rndMatrix;
}

void problem1(double** a, double** b, int N, int p){
    int i,k,j;
  int g;
  double** c;
  c = malloc(sizeof(double*)*N);

  for(g=0; g<N; ++g)
      c[g] = malloc(sizeof(double*)*N);

  //Timer start
  clock_t tStart = clock();
  //time_t tStart, tEnd;
  //tStart =time(NULL);

  //Parallelised part
#pragma omp parallel shared(a,b,c,N) private(i,k,j) num_threads(p)
  {
#pragma omp for schedule(static) nowait
      for(i=0; i<N; ++i){
          for(j=0; j<N; ++j){
                  double sum = 0;
                  for(k=0; k<N; ++k){
                      sum += a[i][k] * b[k][j];
                  }
                  c[i][j]=sum;
          }
      }
  }

  //Timer end
  printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
  //tEnd = time(NULL);
  //printf("Time taken: %ds\n",  tEnd - tStart);
}


int main(void)
{
  int p=0;
  int N=0;
  //User input:

  printf("Enter matrix dimension:\n");
  scanf("%d", &N);

  printf("Please enter nb of threads:\n");
  scanf("%d", &p);

  double **a;
  double **b;

  a = createMatrix(N);
  sleep(2);
  b = createMatrix(N);

  problem1(a,b,N,p);

  return 0;
}

1) How do you measure time 2) What is the exact processor you run it on? 3) How do you compile the code? 4) Please provide a minimal reproducible example — Zulan
1) I've both used clock() and time() as shown in the code. 2) The results of lscpu can be seen in the added picture. 3) gcc -o3 -fopenmp openmpMain.c o2 or o3 is required for the assignment 4) I edited the above code to be runnable — Weller1995
Possible duplicate of OpenMP time and clock() calculates two different results — Zulan
The results are the same yes. Besides that clock() returned with more decimals. — Weller1995

Alain Merigot Alain Merigot · Accepted Answer · 2019-05-14T00:26:15

You use an incorrect algorithm to multiply your matrices in the ijk order.

for(i=0; i<N; ++i){
      for(j=0; j<N; ++j){
           double sum = 0;
           for(k=0; k<N; ++k){
                sum += a[i][k] * b[k][j];
           }
           c[i][j]=sum;
       }
}

Whenever k is incremented in the inner loop, b is traversed column wise and generates a cache miss. The result is that you have one cache miss per iteration. This will largely dominate the computation time and you algorithm is memory bound.

You can increase the number of cores, it will not increase your memory bandwidth (except the slight increase in cache size that may marginally improve computation time).

Open-MP is only useful if you have a core limited problem, not for memory bound computations.

To see the effect of extra cores, you must use another algorithm. For instance, by changing the order of iterations to ikj.

    for(i=0; i<N; ++i){
      for(k=0; k<N; ++k){
        double r = a[i][k];
        for(j=0; j<N; ++j){
          c[i][j] += r * b[k][j];
        }
      }
    }

When the inner index (j) is incremented, c[i][j] and b[i][j] are traversed rowwise. Instead of one miss per iteration, you will have only two misses every eight iterations and the memory bandwith will no longer be the limiting factor. Your computation time will be largely reduced and, it will scale with number of used cores.

Time taken (N=2000, p=1): 4.62s
Time taken (N=2000, p=2): 3.03s
Time taken (N=2000, p=4): 2.34s

ikj is not the only way. You can also use blocked matrix mulplication, where the multiplication is done by ijk, but on small matrices that fit in the LI cache.

#define BL 40
  for (int jj=0;jj<N;jj+=BL)
    for (int kk=0;kk<N;kk+=BL)
      for (i=0;i<N;i++)
        {
          for (j=jj;j<min(jj+BL-1,N);j++)
        {
          double sum=0.0;
          for (k=kk;k<min(kk+BL-1,N);k++)
            sum += a[i][k]*b[k][j];
          c[i][j]=sum;
        }
        }

  }

The algorithm is slightly longer, but as it avoids cache miss, it is also core limited and can be improved by parallelization.

Time taken (N=2000, p=1): 7.22s
Time taken (N=2000, p=2): 3.78s
Time taken (N=2000, p=4): 3.08s

But you will never gain anything if you use open-MP on a memory bound problem.

OpenMP is not reducing run time even though multiple threads are running. How can this be

1 Answers