Why is this OpenMP program slower than single-thread?

Question

Please look at this code.

Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:

g++ -lrt -O2 main.cpp -o nnlv2

Multithread with openMP: http://pastebin.com/fbe4gZSn Compiled with:

g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp

I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?

Some tests:

Single-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 1898983

10 500 500 --- 11009094

10 1000 1000 --- 48116913

Multi-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 2518262

10 500 500 --- 13861504

10 1000 1000 --- 53446849

I don't understand what is wrong.

How much differences were there? Can you give us the timing result? — user482594
Are you using a timer that actually gives you nano-second precision? What are you measuring with your timer? Are you checking how many cpu cycles have elapsed on the system since you started your application? Or are you counting how many cpu cycles your application itself uses? — Nathanael

Ben Voigt Ben Voigt · Accepted Answer · 2011-07-13T00:27:15

Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.

Step 1: Combine loops and use multiply-add:

// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
  for(int j=0;j<NEURONS;j++)
  {
    outputs[j] = 0;

    for(int k=0,l=0;l<INPUTS;l++,k++)
    {
      outputs[j] += inputs[l] * weights[i][k];
    }

    outputs[j] = sigmoid(outputs[j]);
  }

  std::swap(inputs, outputs);
}

Why is this OpenMP program slower than single-thread?

4 Answers