3
votes

I need to compute the interactions between all elements i,j in a vector of objects. In a vector of size N, this amounts to (N*(N-1))/2 computations, and would naively be solved in a nested for loop like this:

for( unsigned int i = 0; i < vector.size()-1; i++ ) {
  for ( unsigned int j = i+1; j < vector.size(); j++ ) {
    //compute interaction between vector[i] and vector[j]
  }
}

The difficulty comes in trying speed up the process with OpenMP parallelization. The number of computations in the inner loop decreases linearly as i increases. As I understand it, #pragma omp parallel for will divide a loop evenly by the number of threads used. Although the outer loop would be evenly divided, the actual computation would not. For example, a vector of length 257 would have (257*256)/2=32896 computations. If OpenMP split the outer loop evenly (thread 1 has i=0...127, thread 2 has i=128...255), thread 1 would have to compute 24640 interactions while thread 2 would have to compute 8256 interactions, taking ~75% as long with a total efficiency of 62%. Splitting the outer loop between 4 threads would take ~44% as long at roughly 57% efficiency. I can verify this is a problem with a MCVE

#include <iostream>
#include <unistd.h>
#include <omp.h>
#include <vector>
#include <ctime>

int main()
{
  timespec sleepTime;
  sleepTime.tv_sec = 0;
  sleepTime.tv_nsec = 1e6; // 1 ms                             
  std::vector< int > dummyVector(257,0);
  #pragma omp parallel for
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j] );
      nanosleep(&sleepTime,NULL);
    }
  }
  return 0;
}

Using nanosleep to simulate my interactions, the 2 thread and 4 thread versions take 75% and 44% as long respectively

[me@localhost build]$ export OMP_NUM_THREADS=1
[me@localhost build]$ time ./Temp

real    0m38.242s ...
[me@localhost build]$ export OMP_NUM_THREADS=2
[me@localhost build]$ time ./Temp

real    0m28.576s ...
[me@localhost build]$ export OMP_NUM_THREADS=4
[me@localhost build]$ time ./Temp

real    0m16.715s ... 

How can I better balance the computation across threads? Is there a way to tell OpenMP to split the outer loop discontinuously?


In an effort to move the nested for loop out of the omp parallel block, I tried precomputing all possible pairs of indices, then looping over those pairs

  std::vector< std::pair < int, int > > allPairs;
  allPairs.reserve((dummyVector.size()*(dummyVector.size()-1))/2);
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      allPairs.push_back(std::make_pair<int,int>(i,j));
    }
  }

  #pragma omp parallel for
  for( unsigned int i = 0; i < allPairs.size(); i++ ) {
    // calculate( dummyVector[allPairs[i].first], 
    //   dummyVector[allPairs[i].second] ); 
    nanosleep(&sleepTime,NULL);
  }

This does efficiently balance the computation across threads, but it introduces the unavoidably serial construction of the index pairs, which will hurt my runtime as N grows. Can I do better than this?

2
If you can come up with an equation that can determine the pair given an N, you won't have to precompute all the pairs.IanPudney
Yes, there is a way to tell OpenMP to split the outer loop into different size chunks. Investigate the use of the schedule clause which exists for exactly the type of problem you have, load imbalance when using the (usually) default static schedule.High Performance Mark
@HighPerformanceMark I knew I couldn't be the first to run into this problem. Both schedule (dynamic) and schedule(guided)(with a reordered outer loop to do the shorter computations first) significantly improved the load balancing. If you'd like to expound on your comment in an answer, I can accept it. Otherwise I will write an answer later explaining the details.user3288829

2 Answers

3
votes

As suggested by @HighPerformanceMark, the solution lies in the scheduling of the OpenMP parallel for loops. The Lawrence Livermore OpenMP tutorial has a pretty good description of the different options, but the general syntax is #pragma parallel for schedule(type[,chunk]), where the chunk parameter is optional. If you don't specify a schedule, the default is implementation specific. For libgomp, the default is STATIC, which divides the loop iterations evenly and contiguously, leading to the poor load balancing for this problem.

Two other scheduling options fix the load balancing problem at the cost of slightly higher overhead. The first is DYNAMIC, which assigns a chunk (default chunk size is 1 loop iteration) to each thread dynamically as the thread finishes its previous work. Thus the code looks like this

#pragma omp parallel for schedule( dynamic )
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j]);
    }
  }

Because the computational cost of the inner loop is structured (decreasing linearly with increasing i), the GUIDED schedule also works well. It also dynamically assigns blocks of work to each thread, but it starts with larger blocks and decreases block size as computation continues. The first block of iterations assigned to a thread is of size number_iterations/number_threads, and each subsequent block is of size remaining_iterations/number_threads. This does require reversing the order of the outer loop, however, so that the initial iterations contain the least amount of work.

#pragma omp parallel for schedule( guided )
  for(unsigned int i = dummyVector.size()-1; i > 0; i-- ) {
    for(unsigned int j = i; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j] ); 
    }
  }
1
votes

I recommend to see the other answer (with the schedule directives for omp parallel for).

There is an alternative, there is a possibility is to compute a linear index and then extract i and j from it, as follows:

If you had to compute both i->j and j->i interations:

#pragma omp parallel for
for(size_t u=0; u<vector.size()*vector.size(); ++u) {
   size_t i = u/vector.size();
   size_t j = u%vector.size();
   if(i != j) {
      compute interaction between vector[i] and vector[j]
   }
}

If interactions are symmetric (like in your case):

There is a similar formula, but it is more difficult. You need to generate the following sequence of (i,j) couples:

(1,0) (2,0) (2,1) (3,0) (3,1) (3,2) (4,0) (4,1) (4,2) (4,3) ...

For index i, the associated sequence of couples (i,j) is of length i, therefore the formula to translate a couple (i,j) to a linear index u is:

u = i(i-1)/2 + j

Now one needs to 'invert' this formula, and retreive the integer i and j

One retrieves the value of i when j=0:

i^2 - i - 2*u = 0

Solving the quadratic equation gives:

i = (1 + (int)(sqrt(1 + 8*u))) / 2

And one deduces the value of j:

j = u - i*(i-1)/2

yes, pretty convoluted way of doing that, I definitely prefer the other proposed solution that is less error-prone !

edit1: declared u,i,j as size_t (instead of int) to avoid overflow (can still occur if vector.size() greater than 2 power 32, but this leaves reasonable room). Thanks to EOF for the remark.

edit2: if interactions are symmetric, it is more subtle than what I thought (my initial proposal was equivalent to the load-inbalanced nested loops, see comments).

edit3: inverted the formula for the symmetric interaction case