Depending exactly from which loop you want to parallelize, you have three options:
#pragma omp parallel
{
#pragma omp for // Option #1
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
#pragma omp for // Option #2
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
}
}
}
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
#pragma omp parallel
{
double value = 0.0;
#pragma omp for
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
}
}
}
Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.
Edit
I've adopted the MCVE supplied in the comments as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>
#define W_OMP
int main(int argc, char *argv[])
{
std::vector<double> h_a(9);
std::generate(h_a.begin(), h_a.end(), std::rand);
int r_b = 500;
int c_b = r_b;
std::vector<double> h_b(r_b * c_b);
std::generate(h_b.begin(), h_b.end(), std::rand);
int r_c = 500;
int c_c = r_c;
int r_a = 3, c_a = 3;
std::vector<double> h_c(r_c * c_c);
auto start = std::chrono::system_clock::now();
#ifdef W_OMP
#pragma omp parallel
{
#endif
int m,n,j,k;
#ifdef W_OMP
#pragma omp for
#endif
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0,a;
for(j=0; j<r_b; j++)
{
for(k=0; k<c_b; k++)
{
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
}
h_c[m*c_c + n] = value;
}
}
#ifdef W_OMP
}
#endif
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms"
#ifdef W_OMP
"\t with OMP"
#else
"\t without OMP"
#endif
"\n";
return 0;
}
As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse
was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.
Benchmarks
Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms
without OMP and 1ms
with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).
Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms
. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.
Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms
, with 15s
. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.
Summary
As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for
loop you come across. Things that can screw up your desired xN speedup:
- Not enough work per thread to justify the overhead of creating the additional threads
- The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
- Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
- OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
- SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.