OpenMP paralelization inhibits vectorization

Question

I am new to OpenMP and I am trying to paralelize following code using OpenMP:

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];

       for(int i=0;i<j;i++)
       {
           inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
       }
   }
}

Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:

#pragma omp parallel for
for(int i=0;i<j;i++)
{
    inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}

the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp

Thanks for any advice!

EDIT: I am using __restrict keyword for the arrays.

EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.

EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.

It's easier to vectorize manually, than parallelize. Why not to just do it? (and keep automatic parallelization) — Serge Rogatch

Sergey Sergey · Accepted Answer · 2016-11-16T12:49:56

My guess is that after you parallelized the inner loop, your compiler lost the track of inA2, inB2 and outX. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++ has an appropriate extension. So try to add __restrict to declarations of the pointers touched by the loop. For example, replace

double* outX;

with

double* __restrict outX;

OpenMP paralelization inhibits vectorization

3 Answers