7
votes

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.

By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.

2
@eoinmullan You seem to be testing things on different machines. Saying that you get 2x speedup on Ivy Bridge with AVX doesn't mean will get more than 2x on Haswell with AVX2. This is definitely the case if the machines have different amounts of memory bandwidth. You need to do what Ben said. Run all the tests on the same machine. Otherwise you're comparing apples to oranges.Mysticial
What is your memory bus width? How many banks? Are they the same on both machines?stark
You still need to compare on the same machine, since different machines (and different models of processors) have different behaviour in regards to memory bandwidth, cache-sizes, memory speed, cache-speed, etc. If you get better speed on the same machine, with AVX than AVX2, then it's possibly a sign that something isn't quite right with the compilation - but just comparing two different machines with a whole range of different properties will not show that.Mats Petersson
That's exactly what I'd expect to see assuming your benchmark is memory-bound. If your Ivy Bridge machines have more memory bandwidth than the Haswell ones, then it's totally expected to see the scaling be higher on Ivy Bridge than Haswell. If that's the case, then no surprise here.Mysticial
@LưuVĩnhPhúc Yeah, but RyuJIT only uses 128 bits on AVX, and _mm256_add_epi16 is an invalid instruction on my AVX processor. It looks from the intel intrinsics guide that only double and float operations are available on 256 bit registers with AVX.eoinmullan

2 Answers

13
votes

On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.

When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.

Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).

At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.

This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).

3
votes

(From the comments on the question)

If arithmetic operations are not the bottle neck in an algorithm's execution then using SIMD will not provide a speed-up. Other bottlenecks could be memory bandwidth, cache-sizes, memory speed, cache-speed. If a processor with AVX out-performs an AVX2 processor in these areas then it will benefit more from using SIMD intrinsics.