parallelizing in openMP

Question

I have the following code that I want to paralleize using OpenMP

for(m=0; m<r_c; m++)
{
    for(n=0; n<c_c; n++)
    {
        double value = 0.0;
        for(j=0; j<r_b; j++)
            for(k=0; k<c_b; k++)
            {
                double a;
                if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                    a = 0.0;
                else
                    a = h_a[((m-j)*c_a) + (n-k)];
                //printf("%lf\t", a);
                value += h_b[(j*c_b) + k] * a;
            }
        h_c[m*c_c + n] = value;
        //printf("%lf\t", h_c[m*c_c + n]);
    }
    //cout<<"row "<<m<<" completed"<<endl;
}

In this I want every thread to perform "for j" and "for k" simultaneouly. I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result. How can I do this in an optimized manner. thanks in advance.

@AviGinsburg its the code for matrix 2D full convolution and i am trying that each thread perform the 'j' and 'k' loops in parallel. on doing so I am getting the correct result but the execution time is getting more than doubled. I am not able to figure out the problem — Manish Khilnani

Avi Ginsburg Avi Ginsburg · Accepted Answer · 2015-06-04T05:47:58

Depending exactly from which loop you want to parallelize, you have three options:

#pragma omp parallel
{
#pragma omp for    // Option #1
    for(m=0; m<r_c; m++)
    {
        for(n=0; n<c_c; n++)
        {
            double value = 0.0;
#pragma omp for    // Option #2
            for(j=0; j<r_b; j++)
                for(k=0; k<c_b; k++)
                {
                    double a;
                    if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                        a = 0.0;
                    else
                        a = h_a[((m-j)*c_a) + (n-k)];
                    //printf("%lf\t", a);
                    value += h_b[(j*c_b) + k] * a;
                }
                h_c[m*c_c + n] = value;
                //printf("%lf\t", h_c[m*c_c + n]);
        }
        //cout<<"row "<<m<<" completed"<<endl;
    }
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
    for(n=0; n<c_c; n++)
    {
#pragma omp parallel
            {
            double value = 0.0;
#pragma omp for
            for(j=0; j<r_b; j++)
                for(k=0; k<c_b; k++)
                {
                    double a;
                    if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                        a = 0.0;
                    else
                        a = h_a[((m-j)*c_a) + (n-k)];
                    //printf("%lf\t", a);
                    value += h_b[(j*c_b) + k] * a;
                }
            h_c[m*c_c + n] = value;
            //printf("%lf\t", h_c[m*c_c + n]);
        }
    }
    //cout<<"row "<<m<<" completed"<<endl;
}

Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.

Edit

I've adopted the MCVE supplied in the comments as follows:

#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>

#define W_OMP
int main(int argc, char *argv[])
{
    std::vector<double> h_a(9);
    std::generate(h_a.begin(), h_a.end(), std::rand);
    int r_b = 500;
    int c_b = r_b;
    std::vector<double> h_b(r_b * c_b);
    std::generate(h_b.begin(), h_b.end(), std::rand);
    int r_c = 500;
    int c_c = r_c;
    int r_a = 3, c_a = 3;
    std::vector<double> h_c(r_c * c_c);

    auto start = std::chrono::system_clock::now();

#ifdef W_OMP
#pragma omp parallel 
    {
#endif
        int m,n,j,k;
#ifdef W_OMP
#pragma omp for 
#endif
        for(m=0; m<r_c; m++)
        {
            for(n=0; n<c_c; n++)
            {
                double value = 0.0,a;
                for(j=0; j<r_b; j++)
                {
                    for(k=0; k<c_b; k++)
                    {
                        if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                            a = 0.0;
                        else a = h_a[((m-j)*c_a) + (n-k)];
                        value += h_b[(j*c_b) + k] * a;
                    }
                }
                h_c[m*c_c + n] = value;
            }
        }
#ifdef W_OMP
    }
#endif
    auto end = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << elapsed.count() << "ms"
#ifdef W_OMP
        "\t with OMP"
#else
        "\t without OMP"
#endif
        "\n";

    return 0;

}

As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.

Benchmarks

Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).

Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.

Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.

Summary

As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:

Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

parallelizing in openMP

1 Answers

Edit

Benchmarks

Summary