High-level analysis
It is a common fallacy that hybrid programming (e.g. MPI+OpenMP) is a good idea. This fallacy is widely espoused by so-called HPC experts, many of whom are paper pushers at supercomputing centers and do not write much code. An expert take-down of the MPI+Threads fallacy is Exascale Computing without Threads.
This is not to say that flat MPI is the best model. For example, MPI experts espouse a two-level MPI-only approach in Bi-modal MPI and MPI+threads Computing on Scalable Multicore Systems and MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory (free version). In the so-called MPI+MPI model, the programmer exploits shared-memory coherence domains using MPI shared-memory instead of OpenMP but with a private by default data model, which reduces the incidence of race conditions. Furthermore, MPI+MPI uses only one runtime system, which makes resource management, and process topology/affinity easier. In contrast, MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution model with threads (i.e. make MPI calls between OpenMP parallel regions) or enable MPI_THREAD_MULTIPLE
in order to make MPI calls within threaded regions - MPI_THREAD_MULTIPLE
entails noticeable overhead on today's platforms.
This topic could cover many pages, which I do not have time to write at the moment, so please see the cited links.
Reducing OpenMP runtime overheads
One reason that MPI+OpenMP does not perform as well as pure MPI is that OpenMP runtime overheads have a tendency to appear in too many places. One type of unnecessary runtime overhead is from nested parallelism. Nested parallelism occurs when one nests on omp parallel
construct inside another. Most programmers do not know that a parallel region is a relatively expensive construct and one should try to minimize them. Furthermore, omp parallel for
is the fusion of two constructions - parallel
and for
- and one should really try to think of these independently. Ideally, you create one parallel region that contains many worksharing constructs, such as for
, sections
, etc.
Below is your code modified to use only one parallel region and parallelism across both for
loops. Because collapse
requires perfect nesting (nothing between the two for
loops), I had to move some code inside. However, nothing stops the compiler from hoisting this loop invariant back out after the OpenMP lowering (this is a compiler concept, which you can ignore), so that code might still only execute end_i
times, rather than end_i*end_j
times.
Update: I've adapted the code from the other answer to demonstrate collapse
instead.
There are a variety of ways to parallelize these two loops with OpenMP. Below you can see four versions, of all of which are compliant with OpenMP 4. Version 1 is likely to be the best, at least in current compilers. Version 2 uses collapse but not simd
(it is compliant with Open 3). Version 3 could be the best, but is harder to implement ideally and does not lead to SIMD code generation with some compilers. Version 4 parallelizes only the outer loop.
You should experiment to see which one of these options is the fastest for your application.
#if VERSION==1
#define OUTER _Pragma("omp parallel for")
#define INNER _Pragma("omp simd")
#elif VERSION==2
#define OUTER _Pragma("omp parallel for collapse(2)")
#define INNER
#elif VERSION==3
#define OUTER _Pragma("omp parallel for simd collapse(2)")
#define INNER
#elif VERSION==4
#define OUTER _Pragma("omp parallel for simd")
#define INNER
#else
#error Define VERSION
#define OUTER
#define INNER
#endif
struct {
float cx;
float cy;
} parms;
void update (int pSqrt, int id, int subN, const float * restrict gridPtr[restrict], float * restrict gridNextPtr[restrict])
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
OUTER
for ( int i = beg_i; i < end_i; ++i ) {
INNER
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] + parms.cx * (gridPtr[i+1][j] + gridPtr[i-1][j] - 2 * gridPtr[i][j])
+ parms.cy * (gridPtr[i][j+1] + gridPtr[i][j-1] - 2 * gridPtr[i][j]);
}
}
}
The example code above is correct with the following compilers:
- GCC 5.3.0
- Clang-OpenMP 3.5.0
- Cray C 8.4.2
- Intel 16.0.1.
It will not compile with PGI 11.7, both because of [restrict]
(replacing it with []
is sufficient) and the OpenMP simd
clause. This compiler lacks full support for C99 contrary to this presentation. It's not too surprising that it is not compliant with OpenMP, given it was released in 2011. Unfortunately, I do not have access to a newer version.
#pragma
above the for loop. However, it still gives me the same time with and without it. – gsamaras