5
votes

i'm tring to understand how can i max out the number of operations i can get on my CPU. I'm doing a simple matrix multiplication program, and i have a Skylake processor. I was looking at the wikipedia page for the flops information on this architecture, and i'm having dificulties understanding it.

From my understanding, FMA instructions allow 3 way FP inputs right? And allow to mix between adds and multiplies between them. But what happens when i only add two floats? Does it simply multiply it by one? Can i add 3 floats in 1 cycle, or will that be split? I saw that the skylake, has 32 FLOPs/cycle for single precision inputs, but what's the meaning of "two 8-wide FMA instructions"?

Thank you in advance for the explanations

1
This question becomes more interesting if you compare Haswell and Skylake. Haswell can only do one AVX add per clock cycle but two FMA operations per clock cycle. This means that you can double your addition throughput by using two FMA operations multiplying by 1.0. OTH, the latency for FMA is 5 whereas addition is three on Haswell so you have to use 10 parallel accumulators to get the maximum throughput WITH FMA whereas you only need 3 with addition. On Skylake addition and FMA have the same latency and throughput so there is no reason to use FMA for addition.Z boson

1 Answers

8
votes

FMA calculates ± a*b ± c in a single operation, with a single rounding error. That's what it does, nothing else. Calculating a + b + c cannot be done using an FMA instruction; you need two dependent ADD operations for that.

Depending on the compiler, you may have to turn a compiler option to allow use of FMA instructions, because they don't give results identical to multiply followed by add. And you may have to re-arrange your code in some cases, for example ab + cd + e will be calculated as x = ab; y = FMA (c, d, x), z = y + e but e + ab + c*d will be calculated as x = FMA (a, b, e); z = FMA (c, d, x). The basic operation calculation of an FFT can be performed with eight floating-point operations and can be rewritten as 10 operations using four FMAs and two other operations.

"Two 8-wide FMA instructions" means it can perform FMA instructions with two 256 bit vector registers containing 8 floats each, and two of these in the same cycle.