0
votes

Short: Does the pragma omp for simd OpenMP directive generate code that uses SIMD registers?

Longer: As stated in the OpenMP documentation "The worksharing-loop SIMD construct specifies that the iterations of one or more associated loops will be distributed across threads that already exist [..] using SIMD instructions". From this statement, I would expect the following code (simd.c) to use XMM, YMM or ZMM registers when compiling running gcc simd.c -o simd -fopenmp but it does not.

#include <stdio.h>
#define N 100

int main() {
    int x[N];
    int y[N];
    int z[N];
    int i;
    int sum;

    for(i=0; i < N; i++) {
        x[i] = i;
        y[i] = i;
    }

    #pragma omp parallel
    {
        #pragma omp for simd
        for(i=0; i < N; i++) {
            z[i] = x[i] + y[i];
        }
        #pragma omp for simd reduction(+:sum)
        for(i=0; i < N; i++) {
            sum += x[i];
        }
    }
    printf("%d %d\n",z[N/2], sum);

    return 0;
}

When checking the assembler generated running gcc simd.c -S -fopenmp no SIMD register is used.

I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.

  • XMM registers: gcc simd.c -o simd -O3
  • YMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512
  • ZMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512

However, using the flags -march=skylake-avx512 -mprefer-vector-width=512 combined with -fopenmp does not generates SIMD instructions.

Therefore, I can easily vectorize my code with -O3 without the pragma omp for simd but not for the other way around.

At this point, my purpose is not to generate SIMD instructions but to understand how do OpenMP SIMD directives work in GCC and how to generate SIMD instructions only with OpenMP (without -O3).

1
Adding the simd clause doesn't alter the cost algorithms of popular compilers. Loop length 100 is barely enough to benefit from simd alone (not even avx512) in a simple reduction loop such as yours, and may not be enough to benefit from omp parallel in either loop. omp for simd requires generation of something resembling nested loops. Unless the compiler can specialize for a specific loop count which is a multiple of simd length times number of threads, both inner and outer loops require run-time remainder and possible alignment code.tim18
The usual effect of simd clause with gnu compilers is simply to over-rule detection of possible aliasing which may invalidate simd vectorization.tim18
@tim18: 100 elements is enough for 128-bit SIMD to be worth it on modern x86, especially when the size (multiple of the vector width) and alignment (by 16) are known good. The time to horizontal-sum the vector at the end is pretty small. Unlike some microarchitectures (e.g. some ARM) where the vector unit is only loosely coupled and there are big delays to get a vector result as scalar, x86 only has a couple cycle latency for movd eax, xmm0.Peter Cordes

1 Answers

2
votes

Enable at least -O2 for -fopenmp to work, and for performance in general

gcc simd.c -S -fopenmp

GCC's default is -O0, anti-optimized for consistent debugging. It's never going to auto-vectorize with -O0 because it's pointless when every i value from the C source has to exist in memory, and so on. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

Also impossible when you have to be able to single-step source lines one at a time, and even modify i or memory contents at runtime with the debugger, and have the program keep running like you'd expect the C abstract machine would.

Building without any optimization is utter garbage for performance; it's insane to even consider if you care about performance enough to be using OpenMP. (Except of course for actual debugging.) Often the speedup from anti-optimized to optimized scalar is more than what you could gain from vectorizing that scalar code, but both can be large factors so you definitely want optimizations beyond auto-vectorization.


I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.

Right, so do that. -O3 -march=native -flto is usually your best bet for code that will run on the compile host. Also -fno-trapping-math -fno-math-errno should be safe for everything and enable some better FP function inlining, even if you don't want -ffast-math. Also preferably -fprofile-generate / -fprofile-use profile-guided optimization (PGO), to unroll hot loops and choose branchy vs. branchless appropriately, etc.

#pragma omp parallel is still effective at -O3 -fopenmp - GCC doesn't enable autoparallelization by default.

Also, #pragma omp simd will use a different vectorization style sometimes. In your case, it seems to make GCC forget that it knows the arrays are 16-byte aligned, and use movdqu loads (when AVX isn't available for an unaligned memory source operand for paddd xmm0, [rax]). Compare https://godbolt.org/z/8q8Dqm - the main._omp_fn.0: helper function that main calls doesn't assume alignment. (Although maybe it can't after division by number of threads splits up the array into ranges, if GCC doesn't bother to do vector-sized chunks?)


Use -O2 -fopenmp to get what you were expecting

OpenMP will let gcc vectorize more easily or efficiently for loops where you didn't use restrict on pointer args to functions to let it know that arrays don't overlap, or for floating point to let it pretend that FP math is associative even if you didn't use -ffast-math.

Or if you enable some optimization but not full optimization (e.g. -O2 which doesn't include -ftree-vectorize), then #pragma omp will work the way you expected.

Note that the x[i] = y[i] = i; init loop doesn't get auto-vectorized at -O2, but the #pragma loops are. And that without -fopenmp, pure scalar. Godbolt compiler explorer


The serial -O3 code will run faster for this small N because thread-startup overhead is nowhere near worth it. But for large N, parallelization could help if a single core can't saturate memory bandwidth (e.g. on a Xeon, but most dual/quad-core desktop CPUs can almost saturate mem bandwidth with one core). Or if your arrays are hot in cache on different cores.

Unfortunately(?) even GCC -O3 doesn't manage to do constant-propagation through your whole code and just print the result. Or to fuse the z[i] = x[i]+y[i] loop with the sum(x[]) loop.