OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

Question

I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.

//A and B are double* allocated with malloc, Nu is the lenght of the matrixes 
//which are square

//#pragma omp parallel for
for (i=0; i<Nu; i++){
  for (j=0; j<Nu; j++){
    *(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
    for(k=0;k<Nu ;k++){
      *(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
    }
  }
}

By tweaking omp parameters I've got 200% speed up on my machine. original: llcomp.googlecode.com/hg/examples/mxm.c current: codepad.org/nSfZHp03 — jfs
Code that uses 'fortran' memory layout for the B matrix runs 4-8 faster (the greatest benefit) for 1000x1000 matrices (threaded version takes 0.5 seconds). gist.github.com/790865 — jfs
Have you estimated your Gflops/s? It should be 2.0*n^3/time. Compare that with the max for your CPU: frequency * (SIMD_width)* (2 ILP) * (number of cores). e.g on my 2600k it is (4GHz) * 4(AVX) * 2 (ILP) * 4 cores = 128 DP Gflops/s. Likely your efficiency is less than 10%. — user2088790

Ben Voigt Ben Voigt · Accepted Answer · 2011-01-18T18:57:20

Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.

Try:

for (i=0; i<Nu; i++){
  const double* const Arow = A + i*Nu;
  double* const Crow = C + i*Nu;
#pragma omp parallel for
  for (j=0; j<Nu; j++){
    const double* const Bcol = B + j*Nu;
    double sum = 0.0;
    for(k=0;k<Nu ;k++){
      sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
    }
    Crow[j] = sum;
  }
}

Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.

OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

2 Answers