I have a function which is apparently the bottleneck of my whole program. I thought that parallelization with OpenMP could be helpful.
Here is an working example of my computation (sorry, the function is a little bit long). In my program, some of the work before the 5 nested loops is done somewhere else and is not a problem at all for the efficiency.
#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"
using namespace std::chrono;
void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
std::vector<unsigned int> basis;
for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
boost::dynamic_bitset<> s_bin(Ns,s);
if (s_bin.count() == Ns/2) {
basis.push_back(s);
}
}
std::vector<double> gs(basis.size());
for (unsigned int i = 0; i != gs.size(); i++)
gs[i] = double(std::rand())/RAND_MAX;
unsigned int ns_A = block_size;
unsigned int ns_B = Ns-ns_A;
boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);
// Find the basis of the A block
unsigned int NAsec = sector;
std::vector<double> basis_NAsec;
for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
boost::dynamic_bitset<> s_bin(ns_A,s);
if (s_bin.count() == NAsec)
basis_NAsec.push_back(s);
}
unsigned int bs_A = basis_NAsec.size();
// Find the basis of the B block
unsigned int NBsec = (Ns/2)-sector;
std::vector<double> basis_NBsec;
for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
boost::dynamic_bitset<> s_bin(ns_B,s);
if (s_bin.count() == NBsec)
basis_NBsec.push_back(s);
}
unsigned int bs_B = basis_NBsec.size();
std::vector<std::vector<double> > mddr(bs_A);
for (unsigned int i = 0; i != mddr.size(); i++) {
mddr[i].resize(bs_A);
for (unsigned int j = 0; j != mddr[i].size(); j++) {
mddr[i][j] = 0.0;
}
}
// Main calculation part
for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
boost::dynamic_bitset<> si_bin(Ns,basis[si]);
boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
}
}
}
int main()
{
unsigned int l = 8;
unsigned int Ns = 2*l;
unsigned block_size = 6; // must be between 1 and l
unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
compute_mddr(Ns,block_size,sector);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "Function took " << time_span.count() << " seconds.";
std::cout << std::endl;
}
The compute_mddr
function is basically entirely filling up the matrix mddr
, and this corresponds to the outermost loops 1 and 2.
I decided to parallelize the loop 3, since it's essentially computing a sum. To give order of magnitudes, the loop 3 is over ~50-100 elements in the basis_NBsec
vector, while the two innermost loops si
and sj
run over the ~10000 elements for the vector basis
.
However, when running the code (compiled with -O3 -fopenmp on gcc 5.4.0, ubuntu 16.0.4 and i5-4440 cpu) I see either no speed-up (2 threads) or a very limited gain (3 and 4 threads):
time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real 3m50.439s
user 3m50.428s
sys 0m0.000s
time OMP_NUM_THREADS=2 ./a.out
Function took 227.754 seconds.
real 3m47.758s
user 7m2.140s
sys 0m0.048s
time OMP_NUM_THREADS=3 ./a.out
Function took 181.492 seconds.
real 3m1.495s
user 7m36.056s
sys 0m0.036s
time OMP_NUM_THREADS=4 ./a.out
Function took 150.564 seconds.
real 2m30.568s
user 7m56.156s
sys 0m0.096s
If I understand correctly the numbers from user, for 3 and 4 threads the cpu usage is not good (and indeed, when the code is running I get ~250% cpu usage for 3 threads and barely 300% for 4 threads).
It's my first use of OpenMP, I just played with it very quickly on simple examples. Here, as far as I can see, I'm not modifying any of the shared vectors basis_NAsec
, basis_NBsec
and basis
in the parallel part, only reading (this was an aspect pointed out in several related questions I read).
So, what am I doing wrong ?