Efficiently compute interactions between elements of a vector using openmp

Question

I need to compute the interactions between all elements i,j in a vector of objects. In a vector of size N, this amounts to (N*(N-1))/2 computations, and would naively be solved in a nested for loop like this:

for( unsigned int i = 0; i < vector.size()-1; i++ ) {
  for ( unsigned int j = i+1; j < vector.size(); j++ ) {
    //compute interaction between vector[i] and vector[j]
  }
}

The difficulty comes in trying speed up the process with OpenMP parallelization. The number of computations in the inner loop decreases linearly as i increases. As I understand it, #pragma omp parallel for will divide a loop evenly by the number of threads used. Although the outer loop would be evenly divided, the actual computation would not. For example, a vector of length 257 would have (257*256)/2=32896 computations. If OpenMP split the outer loop evenly (thread 1 has i=0...127, thread 2 has i=128...255), thread 1 would have to compute 24640 interactions while thread 2 would have to compute 8256 interactions, taking ~75% as long with a total efficiency of 62%. Splitting the outer loop between 4 threads would take ~44% as long at roughly 57% efficiency. I can verify this is a problem with a MCVE

#include <iostream>
#include <unistd.h>
#include <omp.h>
#include <vector>
#include <ctime>

int main()
{
  timespec sleepTime;
  sleepTime.tv_sec = 0;
  sleepTime.tv_nsec = 1e6; // 1 ms                             
  std::vector< int > dummyVector(257,0);
  #pragma omp parallel for
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j] );
      nanosleep(&sleepTime,NULL);
    }
  }
  return 0;
}

Using nanosleep to simulate my interactions, the 2 thread and 4 thread versions take 75% and 44% as long respectively

[me@localhost build]$ export OMP_NUM_THREADS=1
[me@localhost build]$ time ./Temp

real    0m38.242s ...
[me@localhost build]$ export OMP_NUM_THREADS=2
[me@localhost build]$ time ./Temp

real    0m28.576s ...
[me@localhost build]$ export OMP_NUM_THREADS=4
[me@localhost build]$ time ./Temp

real    0m16.715s ...

How can I better balance the computation across threads? Is there a way to tell OpenMP to split the outer loop discontinuously?

In an effort to move the nested for loop out of the omp parallel block, I tried precomputing all possible pairs of indices, then looping over those pairs

  std::vector< std::pair < int, int > > allPairs;
  allPairs.reserve((dummyVector.size()*(dummyVector.size()-1))/2);
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      allPairs.push_back(std::make_pair<int,int>(i,j));
    }
  }

  #pragma omp parallel for
  for( unsigned int i = 0; i < allPairs.size(); i++ ) {
    // calculate( dummyVector[allPairs[i].first], 
    //   dummyVector[allPairs[i].second] ); 
    nanosleep(&sleepTime,NULL);
  }

This does efficiently balance the computation across threads, but it introduces the unavoidably serial construction of the index pairs, which will hurt my runtime as N grows. Can I do better than this?

If you can come up with an equation that can determine the pair given an N, you won't have to precompute all the pairs. — IanPudney
Yes, there is a way to tell OpenMP to split the outer loop into different size chunks. Investigate the use of the schedule clause which exists for exactly the type of problem you have, load imbalance when using the (usually) default static schedule. — High Performance Mark
@HighPerformanceMark I knew I couldn't be the first to run into this problem. Both schedule (dynamic) and schedule(guided)(with a reordered outer loop to do the shorter computations first) significantly improved the load balancing. If you'd like to expound on your comment in an answer, I can accept it. Otherwise I will write an answer later explaining the details. — user3288829

user3288829 user3288829 · Accepted Answer · 2015-08-01T20:23:27

As suggested by @HighPerformanceMark, the solution lies in the scheduling of the OpenMP parallel for loops. The Lawrence Livermore OpenMP tutorial has a pretty good description of the different options, but the general syntax is #pragma parallel for schedule(type[,chunk]), where the chunk parameter is optional. If you don't specify a schedule, the default is implementation specific. For libgomp, the default is STATIC, which divides the loop iterations evenly and contiguously, leading to the poor load balancing for this problem.

Two other scheduling options fix the load balancing problem at the cost of slightly higher overhead. The first is DYNAMIC, which assigns a chunk (default chunk size is 1 loop iteration) to each thread dynamically as the thread finishes its previous work. Thus the code looks like this

#pragma omp parallel for schedule( dynamic )
  for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
    for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j]);
    }
  }

Because the computational cost of the inner loop is structured (decreasing linearly with increasing i), the GUIDED schedule also works well. It also dynamically assigns blocks of work to each thread, but it starts with larger blocks and decreases block size as computation continues. The first block of iterations assigned to a thread is of size number_iterations/number_threads, and each subsequent block is of size remaining_iterations/number_threads. This does require reversing the order of the outer loop, however, so that the initial iterations contain the least amount of work.

#pragma omp parallel for schedule( guided )
  for(unsigned int i = dummyVector.size()-1; i > 0; i-- ) {
    for(unsigned int j = i; j < dummyVector.size(); j++ ) {
      // calculate( dummyVector[i], dummyVector[j] ); 
    }
  }

Efficiently compute interactions between elements of a vector using openmp

2 Answers