I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:
- Displaying CPU usage using the Linux tool
top, OpenMP-threads running CPUs have an average usage of 50 %. - With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.
The following pseudo code illustrates the basic template for all parallel regions implemented.
Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).
omp_set_num_threads(ncpu);
#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
int i,j,N,NN;
#pragma omp for
for( i=0; i<N; i++ ){ // Looping over all particles
for ( j=0; j<NN(i); j++ ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
force[i] += function(j);
}
}
}
I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:
As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?