3
votes

I am just looking more closely at OpenMP simd construct, and have three loops which seem not to be vectorized by gcc (brief performance tests), but I think that they could. So I was wondering, whether it is safe to add simd pragma and why gcc is not vectorizing them.

First is a matrix multiplication with values stored as single array:

#pragma omp parallel for
    for(size_t row = 0; row < 100; ++row){
    {#pragma omp simd}
        for(size_t col = 0; col < 100; ++col){              
            float sum = c[row * 100 + col];
            for(size_t k = 0; k < 100; k++){
                sum += a[rows * 100 + k] * b[k * 100 + col];
            }
            c[row * 100 + col] = sum;
        }

I am aware that b is not transposed, which hinders performance. By adding simd pragma the code gets way faster. Is auto-vectorization not possible because of the inner loop?

For the second example I was trying the custom reduction declaration feature of OpenMP, which is not actually needed.

#pragma omp declare reduction(sum : double : omp_out += omp_in) initializer(omp_priv = omp_orig)
double red_result = 0;
#pragma omp parallel for {simd} reduction(sum:red_result)
    for(size_t i = 0; i < 100; ++i){            
        red_result = red_result + a[i];
    }

Does the reduction prevent vectorization? Because I would think that it should work fine?

The last example is a complex loop, with another inner loop and function calls. Simplified it looks something like this:

#pragma omp parallel for {simd}
for(size_t i = 0; i < 100; ++i){
  [..]
  for(size_t j = 0; j < 100; j++){
    if(j != i){
      float k2 = a[i] - b[j];
       k = std::sqrt(k2);           
    }
  }
  [do more with k]
}

So here the problem is probably the sqrt call, which cannot be vectorized? But should the performance be better with the simd pragma? Some brief test suggests that this is the case, but if the auto-vectorization is not possible because of std::sqrt, why should it be possible with the pragma?

Thank you for your help! :)

1
FP math is not associative. Compilers can't autovectorize FP reductions without -ffast-math or an OpenMP pragma that gives them permission to sum in a different order. - Peter Cordes
x86 has hardware support for SIMD sqrt. sqrtpd has as good throughput as sqrtsd on most CPUs, but does 2 double square roots in parallel. agner.org/optimize. - Peter Cordes
In the past, gcc ignored simd in the case of omp parallel simd, it would be reasonable to say parallel disables vectorization (at least where simd would be needed). The post above implies that this changed with gcc 7.1. Even with icc, my experience was that explicit nested loops were needed to accomplish parallel simd. - tim18

1 Answers

3
votes

For math functions in math.h your compiler needs to implement vectorized versions of the math functions. GCC does this with libmvec and ICC does this with SVML. As far as I know Clang does not have native support for vectorized math functions.

Let's consider the following code:

void foo(float * __restrict a, float * __restrict b) {    
    a = (float*)__builtin_assume_aligned(a, 16);
    b = (float*)__builtin_assume_aligned(b, 16);          
    for(int i = 0; i < 100; ++i) {
        b[i] = sqrtf(a[i]);
    }
}

void foo2(float * __restrict a, float * __restrict b) {    
    a = (float*)__builtin_assume_aligned(a, 16);
    b = (float*)__builtin_assume_aligned(b, 16);          
    for(int i = 0; i < 100; ++i) {
        b[i] = sinf(a[i]);
    }
}

GCC, ICC, and Clang vectorize sqrtf (using one iteration of Newton's method). GCC and ICC vectorize sinf with libmvec (_ZGVbN4v_sinf) and SVML (__svml_sinf4) respectively. Clang does not vectorize sinf. See godbolt. sqrt is a special case (since the x86 instruction set has vectorized sqrt instructions) which can be inlined without a vectorized math library.