1
votes

I am studying this tutorial about OpenMP and I came across this exercise, on page 19. It is a pi calculation algorithm which I have to parallelize:

static long num_steps = 100000;
double step;
void main ()
{
  int i;
  double x, pi
  double sum = 0.0;
  step = 1.0 / (double)num_steps;

  for(i = 0; i < num_steps; i++)
  {
     x = (I + 0.5) * step;
     sum = sum + 4.0 / (1.0 + x*x);
  }

  pi = step * sum;
}

I can not use, up to this point, #pragma parallel for. I can only use:

#pragma omp parallel {}
omp_get_thread_num();
omp_set_num_threads(int);
omp_get_num_threads();

My implementation looks like this :

#define NUM_STEPS 800

int main(int argc, char **argv)
{
   int num_steps = NUM_STEPS;
   int i;
  double x;
  double pi;
  double step = 1.0 / (double)num_steps;

  double sum[num_steps];

  for(i = 0; i < num_steps; i++)
  {
      sum[i] = 0;
  }

  omp_set_num_threads(num_steps);
  #pragma omp parallel
  {
    x = (omp_get_thread_num() + 0.5) * step;
    sum[omp_get_thread_num()] += 4.0 / (1.0 + x * x);
  }

  double totalSum = 0;

  for(i = 0; i < num_steps; i++)
  {
    totalSum += sum[i];
  }

  pi = step * totalSum;

  printf("Pi: %.5f", pi);
}

Ignoring the problem by using an sum array (It explains later that it needs to define a critical section for the sum value with #pragma omp critical or #pragma omp atomic), the above impelentation only works for a limited number of threads (800 in my case), where the serial code uses 100000 steps. Is there a way to achieve this with only the aforementioned OpenMP commands, or am I obliged to use #pragma omp parallel for, which hasn't been mentioned yet in the tutorial?

Thanks a lot for your time, I am really trying to grasp the concept of parallelization in C using OpenMP.

1
Are you allowed to use #pragma omp atomic?Increasingly Idiotic
@IncreasinglyIdiotic It is explained later the usefulness of #pragma omp atomic, but only regarding sum value. How can I use it to solve the "too many threads" issue? Is there a way, without having to use parallel for?Vector Sigma
This tutorial is regularly sending confused learners to StackOverflow. I suggest you look for a learning material that follows a more idiomatic high-level approach instead of explaining OpenMP bottom-up. Maybe it works if you are getting a live-workshop, but it definitely does not when reading/watching the material online.Zulan
@Zulan [ad]Stackoverflow: turning confusion to knowledge since 2008 :-D[/ad]Vector Sigma

1 Answers

3
votes

You will need to find a way to make your parallel algorithm somewhat independent from the number of threads.

The most simple way is to do something like:

int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();

for (int i = tid; i < num_steps; i += n_threads) {
    // ...
}

This way the work is split across all threads regardless of the number of threads.

If there were 3 threads and 9 steps:

  • Thread 0 would do steps 0, 3, 6
  • Thread 1 would do steps 1, 4, 7
  • Thread 2 would do steps 2, 5, 8

This works but isn't ideal if each thread is accessing data from some shared array. It is better if threads access sections of data nearby for locality purposes.

In that case you can divide the number of steps by the number of threads and give each thread a contiguous set of tasks like so:

int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();

int steps_per_thread = num_steps / n_threads;
int start = tid * steps_per_thread;
int end = start + steps_per_thread;

for (int i = start; i < end; i++) {
    // ...
}

Now the 3 threads performing 9 steps looks like:

  • Thread 0 does steps 0, 1, 2
  • Thread 1 does steps 3, 4, 5
  • Thread 2 does steps 6, 7, 8

This approach is actually what is most likely happening when #pragma omp for is used. In most cases the compiler just divides the tasks according to the number of threads and assigns each thread a section.

So given a set of 2 threads and a 100 iteration for loop, the compiler would likely give iterations 0-49 to thread 0 and iterations 50-99 to thread 1.

Note that if the number of iterations does not divide evenly by the number of threads the remainder needs to be handled explicitly.