1
votes

I have a problem with OpenMP tasks. I am trying to create parallel version of "for" loop using omp tasks. However, the time of execution this version is close to 2 times longer than base version, where I use omp for, and I do not know what is the reason of this. Look at codes bellows:

omp for version:

t.start();
#pragma omp parallel num_threads(threadsNumber)
{
    for(int ts=0; ts<1000; ++ts)
    {
        #pragma omp for
        for(int i=0; i<size; ++i)
        {
            array_31[i] = array_11[i] * array_21[i];
        }
    }
}
t.stop();
cout << "Time of omp for: " << t.time() << endl;

omp task version:

t.start();
#pragma omp parallel num_threads(threadsNumber)
{
    #pragma omp master
    {
        for(int ts=0; ts<1000; ++ts)
        {
            for(int th=0; th<threadsNumber; ++th)
            {
                #pragma omp task
                {
                    for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
                    {
                        array_32[i] = array_12[i] * array_22[i];
                    }
                }                    
            }

            #pragma omp taskwait
        }
    }
}
t.stop();
cout << "Time of omp task: " << t.time() << endl;

In the tasks version i divide loop in the same way as in omp for. Each of task has to execute the same amount of iterations. Total amount of tasks is equal to total amount of threads.

Performance results:

Time of omp for: 4.54871
Time of omp task: 8.43251

What can be a problem? Is is possible to achive similar performance for both versions? Attached codes are simple, because i wanted to only illustrate my problem, which i try to resolve. I do not expect that both versions give me the same performance, however i would like to reduce the difference.

Thanks for reply. Best regards.

2

2 Answers

0
votes

I think the issue here is the overhead. When you declare a loop as parallel it assigns all the threads to execute their part of the for loop all at once. When you start a task it must go though the whole process of setting up every time you launch a task. Why not just do the following.

#pragma omp parallel num_threads(threadsNumber)
{
    #pragma omp master
    {
        for(int ts=0; ts<1000; ++ts)
        {
            #pragma omp for
            for(int th=0; th<threadsNumber; ++th)
            {
                    for(int i=th*blockSize; i<th*blockSize+blockSize; ++i)
                    {
                        array_32[i] = array_12[i] * array_22[i];
                    }                   
            }


        }
    }
}
0
votes

I'd say that the issue that you're experimenting here is related to the data affinity: when you use the #pragma omp for the distribution of iterations across threads is always the same for all the values of ts, whereas with tasks you don't have a way to specify a binding of tasks to threads.

Once said that, I've executed your program in my machine with three arrays of 1M elements and the results between both versions are closer:

  • t1_for: 2.041443s
  • t1_tasking: 2.159012s

(I used GCC 5.3.0 20151204)