LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz
already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop
on various microarchitectures, from Agner Fog's instruction tables:
- K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or
jecxz
)P4: 4 uops (same as
jecxz
)- P6 (PII/PIII): 8 uops
- Pentium M, Core2: 11 uops
- Nehalem: 6 uops. (11 for
loope
/loopne
). Throughput = 4c (loop
) or 7c (loope/ne
). - SnB-family: 7 uops. (11 for
loope
/loopne
). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory!jecxz
is only 2 uops with same throughput as regularjcc
- Silvermont: 7 uops
- AMD Jaguar (low-power): 8 uops, 5c throughput
- Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1]
/ jrcxz
? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx
and truncate RIP
to EIP
if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops?
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx
/ jnz
on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast
loop
? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.What's special about Bulldozer that made a fast
loop
easy / worth it? Or did AMD waste a bunch of transistors on makingloop
fast? If so, presumably someone thought it was a good idea.
If loop
was fast, it would be perfect for BigInteger arbitrary-precision adc
loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz
. (And dec/jnz
only macro-fuses on SnB-family).
On modern CPUs where dec/jnz
is ok in an ADC loop, loop
would still be nice for ADCX / ADOX loops (to preserve OF).
If loop
had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop
for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.
LOOP
instruction when optimizing for Bulldozer. – Michaelloop
, at the asm level, counting down to zero is slightly more efficient, because the decrement will set the zero flag without needing a compare. I still usually write my C loops from 0..n, for readability though. – Peter Cordes