4
votes

Please look at this code.

Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:

g++ -lrt -O2 main.cpp -o nnlv2

Multithread with openMP: http://pastebin.com/fbe4gZSn Compiled with:

g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp

I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?

Some tests:

Single-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 1898983

10 500 500 --- 11009094

10 1000 1000 --- 48116913

Multi-thread:

Layers Neurons Inputs --- Time (ns)

10 200 200 --- 2518262

10 500 500 --- 13861504

10 1000 1000 --- 53446849

I don't understand what is wrong.

4
How much differences were there? Can you give us the timing result?user482594
Are you using a timer that actually gives you nano-second precision? What are you measuring with your timer? Are you checking how many cpu cycles have elapsed on the system since you started your application? Or are you counting how many cpu cycles your application itself uses?Nathanael

4 Answers

2
votes

Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.

Step 1: Combine loops and use multiply-add:

// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
  for(int j=0;j<NEURONS;j++)
  {
    outputs[j] = 0;

    for(int k=0,l=0;l<INPUTS;l++,k++)
    {
      outputs[j] += inputs[l] * weights[i][k];
    }

    outputs[j] = sigmoid(outputs[j]);
  }

  std::swap(inputs, outputs);
}
2
votes

compiling with -static and -p, running and then parsing gmon.out with gprof I got:

45.65% gomp_barrier_wait_end

That's a lot of time in opemmp's barrier routine. that is the time spent waiting for the other threads to finish. since you're running the parallel for loops many times (LAYERS), you loose the advantage of running in parallel since every time a parallel for loop is finished, there is an implicit barrier call which won't return till all other threads finish.

0
votes

Before anything else, run the test on Multi-thread configuration and MAKE SURE that procexp or task manager will show you 100% CPU usage for it. If it doesn't, then you don't use multiple threads nor multiple processor cores.

Also, taken from wiki:

Environment variables

A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.

0
votes

I don't see where you have actually used OpenMP - try #pragma omp parallel for above the main loop... (documented here, for example)

The slowness is possibly from including OpenMP and it initialising, adding code bloat or otherwise changing the compilation as a result of the compiler flags you introduced to enable it. Alternatively the loops are so small and simple that the overhead of threading far exceeds the performance gain.