4
votes

Where can I find data about "market share" of x86 microarchitectures? What percentage of users of x86-family CPUs have a CPU that supports SSE4.2, AVX, AVX2, etc.?

I'm distributing precompiled binaries for my program, and I would like to know what is the best optimization target, and which SIMD extensions can be reasonably used without runtime checks.

I can find overall Intel vs AMD market share data, but not a breakdown of generations of Intel's and AMD's CPUs. Ideally I'd like breakdown also per OS and per country, but even general global stats for microarchitectures would be better than nothing.

2
Have you considered shipping multiple binaries/DLLs/SOs and figuring out the proper one during the installation? Data like this might not be very easy to find.Daniel Kamil Kozar
@DanielKamilKozar I have considered this (as well as multiversioned functions), but I'm hoping to eliminate that kind of complexity.Kornel
Anything newer than SSE2 (baseline for x86-64) without runtime checks is risky if there's no fallback or install-time detection. AVX and BMI1/2 are very far from being baseline, because Intel is still selling Celeron/Pentium chips with VEX prefix decoding disabled (presumably to make use of silicon with defects in 256-bit execution units), but SSE4.2 is getting closer and SSSE3 is a possibility. See Most recent processor without support of SSSE3 instructions?, and Mac OSX minumum support sse version.Peter Cordes
Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions? has a link to the Valve Hardware Survey for Steam clients (currently showing SSE3 as ~100% installed base, but SSSE3 only at 97%), so if you're shipping a PC game that should correlate pretty well with your target audience. For server stuff, you might easily be able to set an SSE4.2 minimum.Peter Cordes
@PeterCordes That's great info. Please post it as an answer!Kornel

2 Answers

8
votes

Anything newer than SSE2 (baseline for x86-64) without runtime checks is risky if there's no fallback or install-time detection.

AVX and BMI1/2 are sadly very far from being baseline, because Intel is still selling Celeron/Pentium chips with VEX prefix decoding disabled (presumably to make use of silicon with defects in 256-bit execution units), but SSE4.2 is getting closer, and SSSE3 is a possibility. See Most recent processor without support of SSSE3 instructions?, and Mac OSX minumum support sse version

Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions? has a link to the Valve Hardware Survey for Steam clients (currently showing SSE3 as ~100% installed base, but SSSE3 only at 97%), so if you're shipping a PC game that should correlate pretty well with your target audience. The breakdowns are a bit weird, though, for some entries. Like fcmov (x87 branchless conditional-move) is reported as having done down to 97.5%, but every P6-compatible CPU has it. You won't find a CPU with SSE2 but without FCMOV. Perhaps newer versions of Steam aren't testing for it. And perhaps older versions of Steam aren't testing for CMPXCHG16B? So take them with a grain of salt, but they're probably fairly sensible for SSE2/3/SSSE3/SSE4.x, and AVX.

For server stuff, you might easily be able to set an SSE4.2 minimum. Atom/Silvermont support it, and so do AMD's and VIA's low-power architectures, so energy-efficient servers can run it. Ancient mainstream CPUs don't tend to get much use for servers outside of personal home-server use, because they're often slower than a cheaper modern machine that runs cooler.

(Silvermont isn't likely to support AVX soon, even less AVX2 or FMA.)


You don't have to limit yourself to a single binary. You could even let people pick when they download, or your installer could select at install time.

Or you could have a run-time wrapper that picks an executable and dynamic libraries, so you effectively get runtime dispatching while still being able to compile with gcc -O3 -march=haswell or whatever to let the compiler use new instruction sets all over the place (beneficial especially for BMI1/BMI2 for efficient single-uop variable-count shifts).

Another option is dynamic linker tricks, either on a whole-library basis or on a per-function basis like glibc uses to resolve memcpy to __memset_avx2_unaligned_erms. perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?

All of these (except the per-function dynamic linker tricks) are easier than making your code aware of instruction-set extensions at runtime, and have zero performance overhead. (Unless you put stuff in a dynamic library when you wouldn't have otherwise, so it can't inline.)

1
votes

The simple way to solve this problem (speaking as an ex-games programmer), is to simply compile binaries for each CPU level you wish to support (e.g. SSE2, SSE4, AVX2). The 'executable' for the game is simply a cpuid check, which then runs the correct exe depending on which CPU is detected.