2
votes

I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:

  • Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
  • With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.

The following pseudo code illustrates the basic template for all parallel regions implemented. Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).

omp_set_num_threads(ncpu);

#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
   int i,j,N,NN;

   #pragma omp for 
    for( i=0; i<N; i++ ){             // Looping over all particles
       for ( j=0; j<NN(i); j++ ){     // Nested loop over all neighbors of i
          // No communtions between threads, atomic regions,
          // barriers whatsoever.
          force[i] += function(j);
       }
    }
}

I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:

As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?

3
is this large amount of data precached, or you read it from a file in the process? I/O will always be there to destroy one's expectations \= As you said too, multiple access to a same space may cause thrashing, so it'd be good to set some access policies - Rubens
Thanks. I don't need to read it from a file. They are being generated at runtime and are stored in the physical RAM. - Rakulan S.
How evenly distributed is NN(i)? Is load-imbalance likely an issue? You can try different schedules for the loop to see. Otherwise you will need to use a profiler to find out where your time is being spent; I'm quite fond of scalasca for troublesome OpenMP performance issues. - Jonathan Dursi
If the loop is over 10^7 particles and you are seeing this problem with only 2 threads, then there are only a tiny fraction of force entries being falsely shared, so I don't think that's an issue, although you might well be seeing memory affinity issues if you haven't "first touched" the pages with the appropriate threads. - Jonathan Dursi

3 Answers

2
votes

How large is N, and how intensive is NN(i)?

You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.

If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.

1
votes

Assuming that force[i] is plain array of 4 or 8 byte data, you definitely have false sharing, no doubt about it.

Assuming that function(j) is independently calculated, you may want to do something like this:

    for( i=0; i<N; i+=STEP ){             // Looping over all particles
       for ( j=0; j<NN(i); j+=STEP ){     // Nested loop over all neighbors of i
          // No communtions between threads, atomic regions,
          // barriers whatsoever.
       calc_next(i, j);
       }
    }


void calc_next(int i, int j)
{
    int ii, jj;
    for(ii = 0; ii < STEP; ii++)
    {
        for(jj = 0; jj < STEP; jj++)
        {
            force[i+ii] = function(j+jj);
        }
    }
}

That way, you calculate a bunch of things on one thread, and a bunch of things on the next thread, and each bunch is far enough apart that you don't get false sharing.

If you can't do it this way, try to split it up in some other way that leads to larger sections being calculated each time.

0
votes

As the others stated that, false sharing on force could be a reason. Try in this simple way,

#pragma omp for 
for( i=0; i<N; i++ ){
   int sum = force[i];
   for ( j=0; j<NN(i); j++ ){
      sum += function(j);
   }
   force[i] = sum;
}

Technically, it's possible that force[i] = sum still makes a false sharing. But, it's highly unlikely to happen because the other thread would access force[i + N/omp_num_threads()*omp_thread_num()], which is pretty far from force[i].

If still scalability is poor, try to use a profiler such as Intel Parallel Amplifier (or VTune) to see how much memory bandwidth is needed per thread. If so, put some more DRAMs in your computer :) That will really boost memory bandwidth.