Bad performace of parallelized OpenMP code for particle simulation

Question

I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:

Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.

The following pseudo code illustrates the basic template for all parallel regions implemented. Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).

omp_set_num_threads(ncpu);

#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
   int i,j,N,NN;

   #pragma omp for 
    for( i=0; i<N; i++ ){             // Looping over all particles
       for ( j=0; j<NN(i); j++ ){     // Nested loop over all neighbors of i
          // No communtions between threads, atomic regions,
          // barriers whatsoever.
          force[i] += function(j);
       }
    }
}

I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:

As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?

is this large amount of data precached, or you read it from a file in the process? I/O will always be there to destroy one's expectations \= As you said too, multiple access to a same space may cause thrashing, so it'd be good to set some access policies — Rubens
Thanks. I don't need to read it from a file. They are being generated at runtime and are stored in the physical RAM. — Rakulan S.
How evenly distributed is NN(i)? Is load-imbalance likely an issue? You can try different schedules for the loop to see. Otherwise you will need to use a profiler to find out where your time is being spent; I'm quite fond of scalasca for troublesome OpenMP performance issues. — Jonathan Dursi
If the loop is over 10^7 particles and you are seeing this problem with only 2 threads, then there are only a tiny fraction of force entries being falsely shared, so I don't think that's an issue, although you might well be seeing memory affinity issues if you haven't "first touched" the pages with the appropriate threads. — Jonathan Dursi

Cory Nelson Cory Nelson · Accepted Answer · 2012-12-27T21:17:35

How large is N, and how intensive is NN(i)?

You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.

If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.

Bad performace of parallelized OpenMP code for particle simulation

3 Answers