OpenMP parallelization of for loop : poor efficiency in my code

Question

I have a function which is apparently the bottleneck of my whole program. I thought that parallelization with OpenMP could be helpful.

Here is an working example of my computation (sorry, the function is a little bit long). In my program, some of the work before the 5 nested loops is done somewhere else and is not a problem at all for the efficiency.

#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"

using namespace std::chrono;

void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
  std::vector<unsigned int> basis;
  for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
    boost::dynamic_bitset<> s_bin(Ns,s);
    if (s_bin.count() == Ns/2) {
      basis.push_back(s);
    }
  }
  std::vector<double> gs(basis.size());
  for (unsigned int i = 0; i != gs.size(); i++)
    gs[i] = double(std::rand())/RAND_MAX;

  unsigned int ns_A = block_size;
  unsigned int ns_B = Ns-ns_A;
  boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
  boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);

  // Find the basis of the A block
  unsigned int NAsec = sector;
  std::vector<double> basis_NAsec;
  for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
    boost::dynamic_bitset<> s_bin(ns_A,s);
    if (s_bin.count() == NAsec)
      basis_NAsec.push_back(s);
  }
  unsigned int bs_A = basis_NAsec.size();

  // Find the basis of the B block
  unsigned int NBsec = (Ns/2)-sector;
  std::vector<double> basis_NBsec;
  for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
    boost::dynamic_bitset<> s_bin(ns_B,s);
    if (s_bin.count() == NBsec)
      basis_NBsec.push_back(s);
  }
  unsigned int bs_B = basis_NBsec.size();

  std::vector<std::vector<double> > mddr(bs_A);
  for (unsigned int i = 0; i != mddr.size(); i++) {
    mddr[i].resize(bs_A);
    for (unsigned int j = 0; j != mddr[i].size(); j++) {
      mddr[i][j] = 0.0;
    }
  }

  // Main calculation part
  for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
    boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
    for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
      boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);

      double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
      for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
        boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);

        for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
          boost::dynamic_bitset<> si_bin(Ns,basis[si]);
          boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
          si_A_bin.resize(ns_A);
          if (si_A_bin != mu_A_bin)
            continue;
          boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
          si_B_bin.resize(ns_B);
          if (si_B_bin != mu_B_bin)
            continue;

          for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
            boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
            boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
            sj_A_bin.resize(ns_A);
            if (sj_A_bin != nu_A_bin)
              continue;
            boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
            sj_B_bin.resize(ns_B);
            if (sj_B_bin != mu_B_bin)
              continue;
            sum += gs[si]*gs[sj];
          }
        }
      }
      mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
    }
  }
}


int main()
{
  unsigned int l = 8;
  unsigned int Ns = 2*l;
  unsigned block_size = 6; // must be between 1 and l
  unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;

  high_resolution_clock::time_point t1 = high_resolution_clock::now();
  compute_mddr(Ns,block_size,sector);
  high_resolution_clock::time_point t2 = high_resolution_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
  std::cout << "Function took " << time_span.count() << " seconds.";
  std::cout << std::endl;
}

The compute_mddr function is basically entirely filling up the matrix mddr, and this corresponds to the outermost loops 1 and 2. I decided to parallelize the loop 3, since it's essentially computing a sum. To give order of magnitudes, the loop 3 is over ~50-100 elements in the basis_NBsec vector, while the two innermost loops si and sj run over the ~10000 elements for the vector basis.

However, when running the code (compiled with -O3 -fopenmp on gcc 5.4.0, ubuntu 16.0.4 and i5-4440 cpu) I see either no speed-up (2 threads) or a very limited gain (3 and 4 threads):

time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real    3m50.439s
user    3m50.428s
sys 0m0.000s


time OMP_NUM_THREADS=2 ./a.out 
Function took 227.754 seconds.
real    3m47.758s
user    7m2.140s
sys 0m0.048s


time OMP_NUM_THREADS=3 ./a.out 
Function took 181.492 seconds.
real    3m1.495s
user    7m36.056s
sys 0m0.036s


time OMP_NUM_THREADS=4 ./a.out 
Function took 150.564 seconds.
real    2m30.568s
user    7m56.156s
sys 0m0.096s

If I understand correctly the numbers from user, for 3 and 4 threads the cpu usage is not good (and indeed, when the code is running I get ~250% cpu usage for 3 threads and barely 300% for 4 threads).

It's my first use of OpenMP, I just played with it very quickly on simple examples. Here, as far as I can see, I'm not modifying any of the shared vectors basis_NAsec, basis_NBsec and basis in the parallel part, only reading (this was an aspect pointed out in several related questions I read).

So, what am I doing wrong ?

Zulan Zulan · Accepted Answer · 2016-11-03T12:25:24

Taking a quick look at the performance of your program with perf record shows that, regaradless of the number of threads, most of the time is spent in malloc & free. That's generally a bad sign, and it also inhibits parallelization.

Samples: 1M of event 'cycles:pp', Event count (approx.): 743045339605                                                                                                                         
  Children      Self  Command  Shared Object        Symbol                                                                                                                                    
+   17.14%    17.12%  a.out    a.out                [.] _Z12compute_mddrjjj._omp_fn.0                                                                                                         
+   15.45%    15.43%  a.out    libc-2.23.so         [.] __memcmp_sse4_1                                                                                                                       
+   15.21%    15.19%  a.out    libc-2.23.so         [.] __memset_avx2                                                                                                                         
+   13.09%    13.07%  a.out    libc-2.23.so         [.] _int_free                                                                                                                             
+   11.66%    11.65%  a.out    libc-2.23.so         [.] _int_malloc                                                                                                                           
+   10.21%    10.20%  a.out    libc-2.23.so         [.] malloc

The cause for malloc & free is the constant creation of boost::dynamic_bitset objects, which are basically std::vectors. Note: With perf, can be challenging to find the callers of a certain function. You can just run in gdb, interrupt during the execution phase, break balloc, continue to figure out the callers.

The direct approach to improving performance, is trying to keep alive those objects as long as possible to avoid reallocation over and over again. This goes against the usual good practice to declare variables as locally as possible. The transformation of reusing the dynamic_bitset objects could look like the following:

#pragma omp parallel for reduction(+:sum)
  for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
    boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);

    boost::dynamic_bitset<> si_bin(Ns);
    boost::dynamic_bitset<> si_A_bin(Ns);
    boost::dynamic_bitset<> si_B_bin(Ns);

    boost::dynamic_bitset<> sj_bin(Ns);
    boost::dynamic_bitset<> sj_A_bin(Ns);

    boost::dynamic_bitset<> sj_B_bin(Ns);

    for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
      si_bin = basis[si];
      si_A_bin = si_bin;
      assert(si_bin.size() == Ns);
      assert(si_A_bin.size() == Ns);
      assert(mask_A.size() == Ns);
      si_A_bin &= mask_A;
      si_A_bin.resize(ns_A);
      if (si_A_bin != mu_A_bin)
        continue;
      si_B_bin = si_bin;
      assert(si_bin.size() == Ns);
      assert(si_B_bin.size() == Ns);
      assert(mask_B.size() == Ns);
      // Optimization note: dynamic_bitset::operator&
      // does create a new object, operator&= does not
      // Same for >>
      si_B_bin &= mask_B;
      si_B_bin >>= ns_A;
      si_B_bin.resize(ns_B);
      if (si_B_bin != mu_B_bin)
        continue;

      for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
        sj_bin = basis[sj];
        sj_A_bin = sj_bin;
        assert(sj_bin.size() == Ns);
        assert(sj_A_bin.size() == Ns);
        assert(mask_A.size() == Ns);
        sj_A_bin &= mask_A;
        sj_A_bin.resize(ns_A);
        if (sj_A_bin != nu_A_bin)
          continue;

        sj_B_bin = sj_bin;

        assert(sj_bin.size() == Ns);
        assert(sj_B_bin.size() == Ns);
        assert(mask_B.size() == Ns);

        sj_B_bin &= mask_B;
        sj_B_bin >>= ns_A;
        sj_B_bin.resize(ns_B);
        if (sj_B_bin != mu_B_bin)
          continue;
        sum += gs[si]*gs[sj];
      }
    }
  }

This already reduces the single threaded runtime on my system from ~289 s to ~39 s. Also the program scales almost perfectly up to ~10 threads (4.1 s).

For more threads, there are load balance issues in the parallel loop. which can be mitigated a bit by adding schedule(dynamic), but I'm not sure how relevant that is for you.

More importantly, you should consider using std::bitset. Even without the extremely expensive boost::dynamic_bitset constructor, it is very expensive. Most of the time is pent in superflous dynamic_bitest/vector code and memmove/memcmp on a single word.

+   32.18%    32.15%  ope_gcc_dyn  ope_gcc_dyn          [.] _ZNSt6vectorImSaImEEaSERKS1_                                                                                                      
+   29.13%    29.10%  ope_gcc_dyn  ope_gcc_dyn          [.] _Z12compute_mddrjjj._omp_fn.0                                                                                                     
+   21.65%     0.00%  ope_gcc_dyn  [unknown]            [.] 0000000000000000                                                                                                                  
+   16.24%    16.23%  ope_gcc_dyn  ope_gcc_dyn          [.] _ZN5boost14dynamic_bitsetImSaImEE6resizeEmb.constprop.102                                                                         
+   10.25%    10.23%  ope_gcc_dyn  libc-2.23.so         [.] __memcmp_sse4_1                                                                                                                   
+    9.61%     0.00%  ope_gcc_dyn  libc-2.23.so         [.] 0xffffd47cb9d83b78                                                                                                                
+    7.74%     7.73%  ope_gcc_dyn  libc-2.23.so         [.] __memmove_avx_unaligned

That basically goes away if you just use very few words of a std::bitset. Maybe 64 bit will always be enough for you. If it is dynamic over a large range, you could make a template of the entire function and instantiate it for a number of different bitsiszes of which you dynamically select the appropriate one. I suspect you should gain another order of magnitude in performance. This may in turn reduce parallel efficiency, requiring another round of performance analysis.

It's very important to use tools to understand the performance of your codes. There are very simple and very good tools for all sorts of cases. In your case a simple one such as perf is sufficient.

OpenMP parallelization of for loop : poor efficiency in my code

1 Answers