Understanding FMA performance

Question

I would like to understand how to compute FMA performance. If we look into the description here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA

for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.

So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.

Is it right? If so, what could be the reason that I am observing a slightly higher performance?

Hadi Brais Hadi Brais · Accepted Answer · 2019-03-03T16:49:50

A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:

The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.

Understanding FMA performance

2 Answers