Intel MIC stencil computation

Question

I want to write effective parallel application for Intel Xeon Phi coprocessor (61 cores), which does five-point stencil calculation. I wrote two versions of the code.

First: I used OpenMP "#pragma omp parralel for"

void ParallelStencil(const double* macierzIn, double* macierzOut, const int m, const int n)
{
    int m_real = m + 2;
    int n_real = n + 2;

    TimeCPU t;
    t.start();
    #pragma omp parallel for schedule(static,1) shared(macierzIn, macierzOut)
    for(int i=1; i<m_real-1; ++i)
    {
        for(int j=1; j<n-1; ++j)
        {
            macierzOut[i * n_real + j] = Max(macierzIn[i * n_real + j], macierzIn[(i - 1) * n_real + j], macierzIn[(i + 1) * n_real + j],
                                             macierzIn[i * n_real + (j - 1)], macierzIn[i * n_real + (j + 1)]);
       }
    }
    t.stop();
    cout << "\nTime: " << t.time();
}

Second: I divided matrix between 61 cores. Each part of matrix is computed by 4 HW threads running by each core. I this version, I tried reduce cache miss by doing calculations for 4 threads around the same L2 cache.

void ParallelStencil(const double* macierzIn, double* macierzOut, int m, int n)
{
    int m_real = m + 2;
    int n_real = m + 2;
    int coreCount = threadsCount / 4;
    int tID, coreNum, start, stop, step;

    TimeCPU t;
    t.start();
    #pragma omp parallel shared(macierzIn, macierzOut, m, n, m_real, n_real, coreCount) private(tID, coreNum, start, stop, step)
    {
        tID = omp_get_thread_num();
        coreNum = tID / 4;
        start = tID % 4 + ((m / coreCount) * coreNum) + 1;
        stop = (m / coreCount) * (coreNum + 1) + 1;
        if(coreNum == coreCount - 1 && stop != m_real - 1)
        {
                stop = m_real -1;
        }
        step = 4;

        for(int i=start; i<stop; i+=step)
        {
            for(int j=1; j<n+1; ++j)
            {
                macierzOut[i * n_real + j] = Max(macierzIn[i * n_real + j], macierzIn[(i - 1) * n_real + j], macierzIn[(i + 1) * n_real + j],
                                                 macierzIn[i * n_real + (j - 1)], macierzIn[i * n_real + (j + 1)]);

            }
        }
    }
    t.stop();
    cout << "\nTime: " << t.time();
}

In this wersion loop iterations in each part of matrix are executed in this way:
i=0 -> thread 0
i=1 -> thread 1
i=2 -> thread 2
i=3 -> thread 3
i=4 -> thread 0
...

After running this code. Second version was slower. But why?

First version is 124 times faster then sequential algorithm (without parallelization). Second only 106. I compile code using Intel Composer -> icc -openmp -O3 -mmic -o — JudgeDeath
The threads may not be distributed across cores the way you expect. The default affinity setting is 'scatter', which attempts to spread threads out as much as possible. I think that you're relying on the 'compact' affinity setting. Rather than maximizing cache sharing, you may be doing the opposite. Try experimenting with different affinity settings to see if that makes a difference. See software.intel.com/en-us/articles/… for more details. — pburka
I set affinity by cpu_set_t. The same effect as KMP_AFFINIT="compact". — JudgeDeath
If you're using Linux cpusets are you sure you understand how the logical CPUs map onto the KNC hardware? It's not at all obvious, different from the way the mapping happens on Xeons, and unexpected. Logical CPUs 0,241,242,243 are the threads that share the last core, then 1,2,3,4 are on core zero, 5,6,7,8 on core one and so on. You also definitely don't want to affinitize the OpenMP serial thread to logical CPU zero, since that's the "bootstrap processor", which the kernel prefers to use... — Jim Cownie

Hristo Iliev Hristo Iliev · Accepted Answer · 2014-09-04T15:49:55

This is probably more of a comment than an answer. Before diving into the issues with efficient cache utilisation, you should fix the two codes so that they are equivalent. Currently they are not.

Difference #1

First code:

int m_real = m + 2;
int n_real = n + 2;

Second code:

int m_real = m + 2;
int n_real = m + 2; // <---- m instead of n

Difference #2

First code:

for(int j=1; j<n-1; ++j)

Second code:

for(int j=1; j<n+1; ++j) // <---- n+1 instead of n-1

If your matrix happens to not be a square one and m > n, then the second code will definitely be slower as it will have to compute more.

Intel MIC stencil computation

1 Answers

Difference #1

Difference #2