How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

Question

I am trying to enable different simd support using MSVC.

There is a page talking about enabling some simd, such as SSE2, AVX, AVX2 https://docs.microsoft.com/en-us/cpp/build/reference/arch-x86?redirectedfrom=MSDN&view=vs-2019

However, it does not mention how to enable other simd optimizations, e.g., SSE4.1, SSE4.2, SSE3 Is it possible to enable these without enabling AVX?

Also, looks like in MSVC2017 /arch:SSE2 is no longer supported/needed, can I assume that SSE3/SSE4.1/SSE4.2 are enabled by default as well?

can I assume that SSE3/SSE4.1/SSE4.2 are enabled by default as well? - No, SSE2 is baseline for x86-64. Every x86-64 CPU is guaranteed to have SSE2. I assume that's why you don't need an option for it. But there are some AMD x86-64 CPUs without SSE3, and some Intel x86-64 CPUs without SSE4.1 (e.g. first-gen Core 2). — Peter Cordes
I don't know the answer to your question, though. You might only get SSE4 without AVX via intrinsics because MSVC is bad at this (or designed around a runtime-dispatch model, not compile-time), but maybe there's an MSVC option. You could use a compiler like clang where you can use -O3 -msse4.1 or -O3 -march=penryn — Peter Cordes

Soonts Soonts · Accepted Answer · 2020-09-25T04:40:28

VC++ compiler is less smart than you think it is. Here’s how these settings work.

When you’re building 32-bit code and enable SSE1 or SSE2, it enables automatic vectorization into respective instruction sets.

When you’re building 64-bit code, both SSE1 and SSE2 are part of the instruction set, all AMD64 processors in the world are required to support both of these. That’s why you’re getting the warning with /arch:SSE2.

When you set up AVX the compiler does 2 things, enables automatic vectorization into AVX1, also switches instruction encoding (for all of them, both SSE, AVX, manually vectorized and auto-vectorized) from legacy to VEX. VEX is good stuff, enables to fuse unaligned RAM reads into other instructions. It also solves dependency issues which may affect performance, VEX encoded vaddps xmm0, xmm0, xmm1 zeroes out higher 16 bytes of ymm0, while legacy encoded addps xmm0, xmm0, xmm1 keeps the data there.

When you set up AVX2 it does a few minor optimizations, most notably stuff like _mm_set1_epi32 may compile into vpbroadcastd. Also switches encoding to VEX like for AVX1.

Note I marked automatic in bold. Microsoft compiler doesn’t do runtime dispatch or cpuid checks, and the automatic vectorizer doesn’t use SSE3 or 4.1. If you’re writing manually vectorized code the compiler won’t do fallbacks, will emit whatever instructions you asked for. When present, AVX/AVX2 setting only affects their encoding.

If you want to write manually vectorized code that uses SSE3, SSSE3, SSE 4.1, FMA3, AES, SHA, etc., you don’t need to enable anything. You just need to include relevant headers, and ideally ensure in runtime the CPU has them. For the last part, I usually calling __cpuid early on startup and checking these bits, this is to show a comprehensible error message about unsupported CPU, instead of a hard crush later.

How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

2 Answers