Peculiar performance of OpenMP code with tasks: 1 thread performing better than multiple threads

Question

I have the following code for which I have implemented a explicit tasking version :

int waves = N_a + N_b +1;  /*considering N_a == N_b */
#pragma omp parallel firstprivate(a, gap, waves) private(temp, wave, ii, i) shared(np, mp, elements)
    {
#pragma omp master
        {
            for(wave = 0; wave < waves; ++wave) {
                // 0 <= wave < n-1
                if(wave < N_a-1) {
                    elements = wave+1;
                    np = wave+1;
                    mp = 0+1;
                }
                // n-1 <= wave < m
                else if(wave < N_b) {
                    elements = N_a;
                    np = N_a-1+1;
                    mp = wave-(N_a-1)+1;
                }
                // m <= wave < m+n-1
                else {
                    elements = N_a-1-(wave-N_b);
                    np = N_a-1+1;
                    mp = wave-(N_a-1)+1;
                }

                for(ii = 0; ii < elements; ii+=chunk) {
                    min = MIN(elements,ii + chunk);
#pragma omp task firstprivate(ii, np, mp, chunk, elements)
                    {
                        for (i = ii; i < min; i++)
                        {

                        temp[0] = H[(np-i)-1][(mp+i)-1] + similarity(seq_a[a][(np-i)-1],seq_b[a][(mp+i)-1]);
                        temp[1] = H[(np-i)-1][(mp+i)]-gap;
                        temp[2] = H[(np-i)][(mp+i)-1]-gap;
                        temp[3] = 0;
                        H[(np-i)][(mp+i)] = find_array_max(temp,4);
                        }
                    } // task
                } //for loop
 #pragma omp taskwait
            }
        }

    }

Strangely on executing the code the performance for 1 thread is much better that of 2, 4, 8 and 16 threads. There is only a single a parallel region and I have strip-mined the inner for loop so that every "chunk" number of elements will contribute to creating a task.

I insist on creating a task implementation, because the value of elements in this will constantly vary and I feel the code has potential to counter unstructured parallelism with an efficient tasking implementation.

I am trying this on the Intel xe12 version compiler. Following are the readings I am observing for sample chunk size: 256 and N_a = N_b = 4096:

1 threads: 1.237560 2 threads: 7.223232 4 threads: 4.579173 8 threads: 3.663661 16 threads:4.425525

I am noticing a similar behavior for gcc compiler a well. Can someone please why code with 1 thread is doing better than multiple threads. I see similar results for N_a = N_b = 1024, 2048 and 8192 as well.

Thanks.

Can you provide a small, self-contained example to play with? — Massimiliano
"1 threads" means running with OMP_NUM_THREADS=1 or compiling without OpenMP being enabled? — Hristo Iliev

Thomas W. Thomas W. · Accepted Answer · 2012-10-21T17:42:05

All of your code is guarded by #pragma omp master which allows only the master thread to run all of the calculations. You must remove this pragma (or is there any reason for it?) to see at least some scaling.

Peculiar performance of OpenMP code with tasks: 1 thread performing better than multiple threads

1 Answers