My CPU is IvyBridge. Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax] ; c[i]
mulsd xmm3, xmm1 ; c[i]*x^i
mulsd xmm1, xmm2 ; x^(i+1)
addsd xmm0, xmm3 ; sum += c[i]*x^i
add eax, 8
cmp eax, coeff_end
jb L1
And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).
We have two loop-carried dependency ( I skipped add eax, 8
):
1. mulsd xmm1, xmm2
latency of 5 cycles
2. addsd xmm0, xmm3
latency of 3 cycles
Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?
I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.
Please explain why it does take 6/7 cycles instead of just 5 cycles?
mulsd xmm1, xmm2
. Next iteration'smul
can start before this iteration'saddsd
finishes, because it doesn't depend onxmm0
. The other instructions depend on that loop-carried mul chain, but they fork off a separate dep chain for each iteration (except for theaddsd
, which is also loop-carried, but lower latency). – Peter Cordes