The Effect of Architecture When Using SSE / AVX Intrinisics

Question

I wonder how does a Compiler treats Intrinsics.

If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?

If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?

How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?

For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.

The Context

I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.

How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?

Updated my answer. I think you're just looking for GNU C's __attribute__((target("avx"))). — Peter Cordes
godbolt.org/z/lRr9q7, godbolt.org/z/3pKKT2, godbolt.org/z/vViboK — Royi
What am I looking for in those links? compilers use VEX encodings when you compile with -mavx2, and they don't when you don't. This is how it's always worked for gcc/clang/ICC. (And MSVC for -arch:AVX or not.) — Peter Cordes
BTW, it's pointless to #include <emmintrin.h> if you're also going to include the catch-all #include <immintrin.h>. Always just #include <immintrin.h>, unless you want to include less on MSVC to stop yourself from accidentally using certain extensions, because its target-options model is different from gcc/clang. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2019-04-18T14:41:45

GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch

Using -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)

MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.

You should definitely enable AVX if you use AVX intrinsics. I think I've read / seen that without that, MSVC won't always use vzeroupper where it should.

For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to also set tuning options. (But that also enables AVX2 and FMA, which you might not want. I'm not sure if target attributes can set -mtune=xx)

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes (and also

__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small.

See also https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.

With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.

Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))

e.g.

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif


TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src) {
    __m256 v = _mm256_loadu_ps(src);
    ...
    ...
}

With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.

ICC pragma:

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.

#pragma intel optimization_parameter target_arch=AVX

For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)

The Effect of Architecture When Using SSE / AVX Intrinisics

The Context

2 Answers

ICC pragma: