2
votes

I would like to understand how to compute FMA performance. If we look into the description here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA

for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.

So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.

Is it right? If so, what could be the reason that I am observing a slightly higher performance?

2

2 Answers

5
votes

A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:

  • The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
  • There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
  • Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
  • The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
  • Other reasons depending on how exactly your loop works and how you are measuring throughput.
2
votes

Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.

Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.

The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.

There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.