Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

Question

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.

loop on various microarchitectures, from Agner Fog's instruction tables:

K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops

Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops?

Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).

I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.

Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.

If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).

On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).

If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.

It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

It's funny that AMD themselves recommends avoiding the LOOP instruction when optimizing for Bulldozer. — Michael
@Michael: Maybe it doesn't branch-predict the same way? IDK. I found some speculation and plausible theories on groups.google.com/d/msg/comp.arch/5RN6EegUxE0/KETMqmKWVN4J. (Link to one of Paul Clayton's post mid way though. Scroll up for the start of the thread, which was an exact duplicate of my question). hurr durr google your questions >.< — Peter Cordes
One of the other answers says: "LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it. " So, if the compilers have stopped using the instruction, why bother to improve it now? It would not improve the benchmarks for a new CPU... — Bo Persson
" it's not worth speeding it up, 'cause no one uses it 'cause it's slow? " that's genius :-) — Tommylee2k
@BoPersson: If it had been efficient again on P6, compilers would already be using it, and saving a couple code bytes. (And before macro-fused dec-and-branch, saving uops too if it was single-uop). This only applies to the rare cases where a compiler can transform the loop counter into a count-down, since most programmers write their loops to count up. Even without loop, at the asm level, counting down to zero is slightly more efficient, because the decrement will set the zero flag without needing a compare. I still usually write my C loops from 0..n, for readability though. — Peter Cordes

I. J. Kennedy I. J. Kennedy · Accepted Answer · 2018-10-25T02:26:30

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.

His answer made sense. It had to do with paging.

LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.

A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.

Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:

The protection violations involved usually indicate a probable software bug and restart is not desired if one of these violations occurs. In a Protected Mode 80286 system with wait states during any bus cycles, when certain protection violations are detected by the 80286 component, and the component transfers control to the exception handling routine, the contents of the CX register may be unreliable. (Whether CX contents are changed is a function of bus activity at the time internal microcode detects the protection violation.)

To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.

It's this extra burden of saving CX that made LOOP so slow.

Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

3 Answers

This is not a good or complete answer.