0
votes

I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.

Here is what I currently am doing:

#pragma omp parallel for
for (i = 0; i < n; ++i)
{
  ... build array bi ...
  #pragma omp critical
  {
    update_matrix(A, bi)
  }
}

...

subroutine update_matrix(A, b)
{
  printf("id0 = %d\n", omp_get_thread_num());
  #pragma omp parallel sections
  {
    #pragma omp section
    {
      printf("id1 = %d\n", omp_get_thread_num());
      modify columns 1 to j of A using b
    }

    #pragma omp section
    {
      printf("id2 = %d\n", omp_get_thread_num());
      modify columns j+1 to k of A using b
    }
  }
}

The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:

id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...

So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?

1
I'm not sure it'll be of any use in term of performance, but if you want to do nested parallelism (which is what you're trying to achieve here), you need to enable it explicitly. That can be done with the environment variable OMP_NESTED to be set to 'true', or with the function omp_set_nested() inside the codeGilles

1 Answers

1
votes

#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.

Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.

You have several options:

  • Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.

  • Try enable nesting with omp_set_nested(1);

  • Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.

    #pragma omp parallel for
    for (i = 0; i < n; ++i)
    {
      ... build array bi ...
      update_matrix(A, bi); // not critical
    }
    
    ...
    
    subroutine update_matrix(A, b)
    {
      printf("id0 = %d\n", omp_get_thread_num());
        #pragma omp critical(ONE_TO_J)
        {
          printf("id1 = %d\n", omp_get_thread_num());
          modify columns 1 to j of A using b
        }
    
        #pragma omp critical(J_TO_K)
        {
          printf("id2 = %d\n", omp_get_thread_num());
          modify columns j+1 to k of A using b
        }
    }
    
  • Or use atomic operations to edit the matrix, if this is suitable.

    #pragma omp parallel for
    for (i = 0; i < n; ++i)
    {
      ... build array bi ...
      update_matrix(A, bi); // not critical
    }
    
    ...
    
    subroutine update_matrix(A, b)
    {
        float tmp;
        printf("id0 = %d\n", omp_get_thread_num());
        for (int row=0; row<max_row;row++)
            for (int column=0;column<k;column++){
                float(tmp)=some_function(b,row,column);
                #pragma omp atomic
                A[column][row]+=tmp;
                }
    
    }
    

    By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.