From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:
[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.
There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is
add eax,16
which has a latency of only 1.
## Example 12.6b. DAXPY algorithm, 32-bit mode
[...] ; not shown: initialize some regs before the loop
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed beforecmp
andcmp
must be executed beforejl
.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.