1
votes

I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.

The code is designed wrt MPI:

  1. Having a big matrix which I scatter among p processes.
  2. Every process now has a submatrix.
  3. Every process is calling update() in a loop.
  4. When the loop is terminated, master process gathers results.

Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update().

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
        int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
        if ( id / pSqrt == 0) {
                i = 2;
                end_i = subN - 1;
        } else if ( id / pSqrt == (pSqrt - 1) ) {
                i = 1;
                end_i = subN - 2;
        }
        #pragma omp parallel for
        for ( ; i < end_i; ++i) {
                if (id % pSqrt == 0) {
                        j = 2;
                        end_j = subN - 1; 
                } else if ((id + 1) % pSqrt == 0) {
                        j = 1;
                        end_j = subN - 2;
                }
                #pragma omp parallel for
                for ( ; j < end_j; ++j) {
                        gridNextPtr[i][j] = gridPtr[i][j]  +
                          parms.cx * (gridPtr[i+1][j] +
                          gridPtr[i-1][j] -
                          2.0 * gridPtr[i][j]) +
                          parms.cy * (gridPtr[i][j+1] +
                          gridPtr[i][j-1] -
                          2.0 * gridPtr[i][j]);
                }
        }
}

I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1 optimization flag.

2
@HighPerformanceMark no it is not. I had it just before the 1st for loop, but after my research on related questions I decided to move it there. However, I did not see any change on time execution.gsamaras
OK then @HighPerformanceMark I may have interpreted the answers I read wrongly, so I moved the #pragma above the for loop. However, it still gives me the same time with and without it.gsamaras
I updated my question @HighPerformanceMark. Is it better now?gsamaras
I am not sure what you are saying @HighPerformanceMark. Discarding everything that has to do with MPI? But I want to state that the MPI+OpenMp is not faster (as I would expect) than the MPI version?gsamaras
For small number of cores it is quite normal that hybrid parallel programs are not faster than pure MPI. You should search for a difference in scaling for a large number of nodes where pure MPI already had problems.Vladimir F

2 Answers

4
votes

High-level analysis

It is a common fallacy that hybrid programming (e.g. MPI+OpenMP) is a good idea. This fallacy is widely espoused by so-called HPC experts, many of whom are paper pushers at supercomputing centers and do not write much code. An expert take-down of the MPI+Threads fallacy is Exascale Computing without Threads.

This is not to say that flat MPI is the best model. For example, MPI experts espouse a two-level MPI-only approach in Bi-modal MPI and MPI+threads Computing on Scalable Multicore Systems and MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory (free version). In the so-called MPI+MPI model, the programmer exploits shared-memory coherence domains using MPI shared-memory instead of OpenMP but with a private by default data model, which reduces the incidence of race conditions. Furthermore, MPI+MPI uses only one runtime system, which makes resource management, and process topology/affinity easier. In contrast, MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution model with threads (i.e. make MPI calls between OpenMP parallel regions) or enable MPI_THREAD_MULTIPLE in order to make MPI calls within threaded regions - MPI_THREAD_MULTIPLE entails noticeable overhead on today's platforms.

This topic could cover many pages, which I do not have time to write at the moment, so please see the cited links.

Reducing OpenMP runtime overheads

One reason that MPI+OpenMP does not perform as well as pure MPI is that OpenMP runtime overheads have a tendency to appear in too many places. One type of unnecessary runtime overhead is from nested parallelism. Nested parallelism occurs when one nests on omp parallel construct inside another. Most programmers do not know that a parallel region is a relatively expensive construct and one should try to minimize them. Furthermore, omp parallel for is the fusion of two constructions - parallel and for - and one should really try to think of these independently. Ideally, you create one parallel region that contains many worksharing constructs, such as for, sections, etc.

Below is your code modified to use only one parallel region and parallelism across both for loops. Because collapse requires perfect nesting (nothing between the two for loops), I had to move some code inside. However, nothing stops the compiler from hoisting this loop invariant back out after the OpenMP lowering (this is a compiler concept, which you can ignore), so that code might still only execute end_i times, rather than end_i*end_j times.

Update: I've adapted the code from the other answer to demonstrate collapse instead.

There are a variety of ways to parallelize these two loops with OpenMP. Below you can see four versions, of all of which are compliant with OpenMP 4. Version 1 is likely to be the best, at least in current compilers. Version 2 uses collapse but not simd (it is compliant with Open 3). Version 3 could be the best, but is harder to implement ideally and does not lead to SIMD code generation with some compilers. Version 4 parallelizes only the outer loop.

You should experiment to see which one of these options is the fastest for your application.

#if VERSION==1
#define OUTER _Pragma("omp parallel for")
#define INNER _Pragma("omp simd")
#elif VERSION==2
#define OUTER _Pragma("omp parallel for collapse(2)")
#define INNER
#elif VERSION==3
#define OUTER _Pragma("omp parallel for simd collapse(2)")
#define INNER
#elif VERSION==4
#define OUTER _Pragma("omp parallel for simd")
#define INNER
#else
#error Define VERSION
#define OUTER
#define INNER
#endif


struct {
    float cx;
    float cy;
} parms;

void update (int pSqrt, int id, int subN, const float * restrict gridPtr[restrict], float * restrict gridNextPtr[restrict])
{
    int beg_i = 1, beg_j = 1;
    int end_i = subN - 1, end_j = subN - 1;
    if ( id / pSqrt == 0 ) {
        beg_i = 2;
    } else if ( id / pSqrt == (pSqrt - 1) ) {
        end_i = subN - 2;
    }
    if (id % pSqrt == 0) {
        beg_j = 2;
    } else if ((id + 1) % pSqrt == 0) {
        end_j = subN - 2;
    }
    OUTER
    for ( int i = beg_i; i < end_i; ++i ) {
        INNER
        for ( int j = beg_j; j < end_j; ++j ) {
            gridNextPtr[i][j] = gridPtr[i][j] + parms.cx * (gridPtr[i+1][j] + gridPtr[i-1][j] - 2 * gridPtr[i][j])
                                              + parms.cy * (gridPtr[i][j+1] + gridPtr[i][j-1] - 2 * gridPtr[i][j]);
        }
    }
}

The example code above is correct with the following compilers:

  • GCC 5.3.0
  • Clang-OpenMP 3.5.0
  • Cray C 8.4.2
  • Intel 16.0.1.

It will not compile with PGI 11.7, both because of [restrict] (replacing it with [] is sufficient) and the OpenMP simd clause. This compiler lacks full support for C99 contrary to this presentation. It's not too surprising that it is not compliant with OpenMP, given it was released in 2011. Unfortunately, I do not have access to a newer version.

3
votes

What about this version (not tested)?

Please compile it and test it. If it works better, I'll explain it more. BTW, using some more aggressive compiler options might help too.

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
    int beg_i = 1, beg_j = 1;
    int end_i = subN - 1, end_j = subN - 1;
    if ( id / pSqrt == 0 ) {
        beg_i = 2;
    } else if ( id / pSqrt == (pSqrt - 1) ) {
        end_i = subN - 2;
    }
    if (id % pSqrt == 0) {
        beg_j = 2;
    } else if ((id + 1) % pSqrt == 0) {
        end_j = subN - 2;
    }
    #pragma omp parallel for schedule(static)
    for ( int i = beg_i; i < end_i; ++i ) {
        #pragma omp simd
        for ( int j = beg_j; j < end_j; ++j ) {
            gridNextPtr[i][j] = gridPtr[i][j] +
                parms.cx * (gridPtr[i+1][j] +
                        gridPtr[i-1][j] -
                        2.0 * gridPtr[i][j]) +
                parms.cy * (gridPtr[i][j+1] +
                        gridPtr[i][j-1] -
                        2.0 * gridPtr[i][j]);
        }
    }
}

EDIT: Some explanations on what I did to the code...

  1. The initial version was using nested parallelism for no reason at all (a parrallel region nested within another parallel region). This was likely to be very counter effective and I simply removed it.
  2. The loop indexes i and j were declared and initialised outside of the for loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private) whereas having them within the for statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for statement was easy.
  3. You were changing the boundaries of the j loops inside the parallel region for no good reason. You would have had to declare end_j private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2) directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i and beg_j outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.

Now from there, the code was suitable for vectorisation, and the addition of a simple simd directive on the j loop would enforce it, should the compiler fail to see the possible vectorisation by itself.