Why is parallelization with openmp slower on some machines

Question

I am developing a numerical simulation that can run parallelized to enhance the speed. I usually run the simulation multiple times and then average over the individual results. This loop, the multiple runs, is parallelized using openmp:

    // set the number of threads
    omp_set_num_threads (prms.nthreads);

#pragma omp parallel if(prms.parallel) shared(total) private(iRun)
    {
#pragma omp for schedule(dynamic)
        for (iRun = 1; iRun <= prms.number_runs; iRun++)
        {
            // perform the simulation
        }
    }

There are literally no shared variables, except total, which is an array of size iRun where each element holds the result of the corresponding run. On all machines that I tested so far, the speed increased proportionally with the amount of cores; So it is 4 times as fast when using 4 threads than without parallelization. However, on our computational cluster this is NOT the case (the second run is with parallelization and uses 2 threads, so that it should be twice as fast):

$ time hop ...

real    0m50.595s
user    0m50.484s
sys 0m0.088s

$ time hop ... -P

real    1m35.505s
user    3m9.238s
sys 0m0.134s

As you see, the parallel calculations are much slower than the serialized ones, even in total. I am sure that this is not a memory issue and that the computer has multiple cores.

What could be the issue? Is it maybe the openmp implementation? Or is something in the system misconfigured? I really don't know what to look for.

Actually, what I see is that your cluster implementations of real, user and sys report higher times than the corresponding values on your test machine. What timings do you get when you use the clock on the wall of your office ? Or, perhaps more to the point, what timings do you get if you insert calls to a clock routine at the start and end of the program ? — High Performance Mark
Both runs are on the same machine, I just once set the -P flag, making the program run parallel. The second run really was/felt longer. On my laptop for example, the second run would show about 25 or 30 seconds, being twice as fast as the first one... — janoliver
Does your simulation perform lot of I/O operations? Do your threads constantly jump from one core to another during the parallel run? Does your computation involve random numbers generation (which would explain why you need to run it multiple times)? If so, how do you generate all these numbers? (depending on how you do things, there could be a hidden shared variable). — François Févotte
Are you sure the random number generation library is the same on both machines? — François Févotte
Could you use different RNG instances for all threads (using gsl_rng_alloc) and redo your timings to see if something changed? — François Févotte

Vanwaril Vanwaril · Accepted Answer · 2012-04-20T05:46:21

It seems like cache coherence would be an issue. If total is your shared array, and each thread updates its own cell in total, since the threads are dynamically picking work, it is very likely that the threads have to update adjacent values in total which could be in the same cache line.

On your test machines, this isn't likely to hurt much, since total is probably coherent in a shared L3, but in a cluster, where it needs to go back and forth over the network, this should hurt.

Why is parallelization with openmp slower on some machines

2 Answers