I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
OMP_NESTED
to be set to'true'
, or with the functionomp_set_nested()
inside the code – Gilles