I am developing a numerical simulation that can run parallelized to enhance the speed. I usually run the simulation multiple times and then average over the individual results. This loop, the multiple runs, is parallelized using openmp:
// set the number of threads
omp_set_num_threads (prms.nthreads);
#pragma omp parallel if(prms.parallel) shared(total) private(iRun)
{
#pragma omp for schedule(dynamic)
for (iRun = 1; iRun <= prms.number_runs; iRun++)
{
// perform the simulation
}
}
There are literally no shared variables, except total
, which is an array of size iRun
where each element holds the result of the corresponding run. On all machines that I tested so far, the speed increased proportionally with the amount of cores; So it is 4 times as fast when using 4 threads than without parallelization. However, on our computational cluster this is NOT the case (the second run is with parallelization and uses 2 threads, so that it should be twice as fast):
$ time hop ...
real 0m50.595s
user 0m50.484s
sys 0m0.088s
$ time hop ... -P
real 1m35.505s
user 3m9.238s
sys 0m0.134s
As you see, the parallel calculations are much slower than the serialized ones, even in total. I am sure that this is not a memory issue and that the computer has multiple cores.
What could be the issue? Is it maybe the openmp implementation? Or is something in the system misconfigured? I really don't know what to look for.
gsl_rng_alloc
) and redo your timings to see if something changed? – François Févotte