I need to compute the interactions between all elements i,j
in a vector of objects. In a vector of size N
, this amounts to (N*(N-1))/2
computations, and would naively be solved in a nested for loop like this:
for( unsigned int i = 0; i < vector.size()-1; i++ ) {
for ( unsigned int j = i+1; j < vector.size(); j++ ) {
//compute interaction between vector[i] and vector[j]
}
}
The difficulty comes in trying speed up the process with OpenMP parallelization. The number of computations in the inner loop decreases linearly as i
increases. As I understand it, #pragma omp parallel for
will divide a loop evenly by the number of threads used. Although the outer loop would be evenly divided, the actual computation would not. For example, a vector of length 257 would have (257*256)/2=32896 computations. If OpenMP split the outer loop evenly (thread 1 has i=0...127, thread 2 has i=128...255), thread 1 would have to compute 24640 interactions while thread 2 would have to compute 8256 interactions, taking ~75% as long with a total efficiency of 62%. Splitting the outer loop between 4 threads would take ~44% as long at roughly 57% efficiency. I can verify this is a problem with a MCVE
#include <iostream>
#include <unistd.h>
#include <omp.h>
#include <vector>
#include <ctime>
int main()
{
timespec sleepTime;
sleepTime.tv_sec = 0;
sleepTime.tv_nsec = 1e6; // 1 ms
std::vector< int > dummyVector(257,0);
#pragma omp parallel for
for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
// calculate( dummyVector[i], dummyVector[j] );
nanosleep(&sleepTime,NULL);
}
}
return 0;
}
Using nanosleep to simulate my interactions, the 2 thread and 4 thread versions take 75% and 44% as long respectively
[me@localhost build]$ export OMP_NUM_THREADS=1
[me@localhost build]$ time ./Temp
real 0m38.242s ...
[me@localhost build]$ export OMP_NUM_THREADS=2
[me@localhost build]$ time ./Temp
real 0m28.576s ...
[me@localhost build]$ export OMP_NUM_THREADS=4
[me@localhost build]$ time ./Temp
real 0m16.715s ...
How can I better balance the computation across threads? Is there a way to tell OpenMP to split the outer loop discontinuously?
In an effort to move the nested for loop out of the omp parallel block, I tried precomputing all possible pairs of indices, then looping over those pairs
std::vector< std::pair < int, int > > allPairs;
allPairs.reserve((dummyVector.size()*(dummyVector.size()-1))/2);
for(unsigned int i = 0; i < dummyVector.size()-1; i++ ) {
for(unsigned int j = i+1; j < dummyVector.size(); j++ ) {
allPairs.push_back(std::make_pair<int,int>(i,j));
}
}
#pragma omp parallel for
for( unsigned int i = 0; i < allPairs.size(); i++ ) {
// calculate( dummyVector[allPairs[i].first],
// dummyVector[allPairs[i].second] );
nanosleep(&sleepTime,NULL);
}
This does efficiently balance the computation across threads, but it introduces the unavoidably serial construction of the index pairs, which will hurt my runtime as N
grows. Can I do better than this?
schedule
clause which exists for exactly the type of problem you have, load imbalance when using the (usually) defaultstatic
schedule. – High Performance Markschedule (dynamic)
andschedule(guided)
(with a reordered outer loop to do the shorter computations first) significantly improved the load balancing. If you'd like to expound on your comment in an answer, I can accept it. Otherwise I will write an answer later explaining the details. – user3288829