I have the following code for which I have implemented a explicit tasking version :
int waves = N_a + N_b +1; /*considering N_a == N_b */
#pragma omp parallel firstprivate(a, gap, waves) private(temp, wave, ii, i) shared(np, mp, elements)
{
#pragma omp master
{
for(wave = 0; wave < waves; ++wave) {
// 0 <= wave < n-1
if(wave < N_a-1) {
elements = wave+1;
np = wave+1;
mp = 0+1;
}
// n-1 <= wave < m
else if(wave < N_b) {
elements = N_a;
np = N_a-1+1;
mp = wave-(N_a-1)+1;
}
// m <= wave < m+n-1
else {
elements = N_a-1-(wave-N_b);
np = N_a-1+1;
mp = wave-(N_a-1)+1;
}
for(ii = 0; ii < elements; ii+=chunk) {
min = MIN(elements,ii + chunk);
#pragma omp task firstprivate(ii, np, mp, chunk, elements)
{
for (i = ii; i < min; i++)
{
temp[0] = H[(np-i)-1][(mp+i)-1] + similarity(seq_a[a][(np-i)-1],seq_b[a][(mp+i)-1]);
temp[1] = H[(np-i)-1][(mp+i)]-gap;
temp[2] = H[(np-i)][(mp+i)-1]-gap;
temp[3] = 0;
H[(np-i)][(mp+i)] = find_array_max(temp,4);
}
} // task
} //for loop
#pragma omp taskwait
}
}
}
Strangely on executing the code the performance for 1 thread is much better that of 2, 4, 8 and 16 threads. There is only a single a parallel region and I have strip-mined the inner for loop so that every "chunk" number of elements will contribute to creating a task.
I insist on creating a task implementation, because the value of elements in this will constantly vary and I feel the code has potential to counter unstructured parallelism with an efficient tasking implementation.
I am trying this on the Intel xe12 version compiler. Following are the readings I am observing for sample chunk size: 256 and N_a = N_b = 4096:
1 threads: 1.237560 2 threads: 7.223232 4 threads: 4.579173 8 threads: 3.663661 16 threads:4.425525
I am noticing a similar behavior for gcc compiler a well. Can someone please why code with 1 thread is doing better than multiple threads. I see similar results for N_a = N_b = 1024, 2048 and 8192 as well.
Thanks.
OMP_NUM_THREADS=1
or compiling without OpenMP being enabled? – Hristo Iliev