addiu $6,$6,5
bltz $6,$L5
nop
...
$L5:
How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss? (MIPS originally stood for Microprocessor Without Interlocked Pipeline Stages, and had a load delay slot instead of interlocking.)
Original MIPS I is a classic 5-stage RISC IF ID EX MEM WB
design that hides all of its branch latency with a single branch-delay slot by checking branch conditions early, in the ID stage (correction: this was the mistake, go read this answer; don't be misled by the rest of the details in the question based on this false premise). Which is why it's limited to equal/not-equal, or sign-bit checks like lt or ge zero, not lt between two registers that would need carry-propagation through an adder.
Doesn't this mean that branches need their input ready a cycle earlier than ALU instructions? The bltz
enters the ID stage in the same cycle that addiu
enters EX.
MIPS I (aka R2000) uses bypass forwarding from EX-output to EX-input so normal integer ALU instructions (like a chain of addu
/xor
) have single-cycle latency and can run in consecutive cycles.
MIPS stands for "Microprocessor without Interlocked Pipeline Stages", so it doesn't detect RAW hazards; code has to avoid them. (Hence load-delay slots on first-gen MIPS, with MIPS II adding interlocks to stall in that case, invalidating the acronym :P).
But I never see any discussion of calculating the branch condition multiple instructions ahead to avoid a stall. (The addiu/bltz example was emitted by MIPS gcc5.4 -O3 -march=mips1
on Godbolt, which does respect load-delay slots, filling with nop
if needed.)
Does it use some kind of trick like EX reading inputs on the falling edge of the clock, and ID not needing forwarded register values until the rising edge? (With EX producing its results early enough for that to work)
I guess that would make sense if the clock speed is capped low enough for cache access to be single-cycle.
Stalling or bubble in MIPS claims that lw
+ a beq
on the load result needs 2 stall cycles because it can't forward. That's not accurate for actual MIPS I (unless gcc is buggy). It does mention half clock cycles, though, allowing a value to be written and then read from the register file in the same whole cycle.
lw $6, ($6)
nop
bltz $6, $L5
because the one instruction load delay slot is not enough. – Ross Ridgebltz
entersID
along withaddiu
enteringEX
, they have a whole clock to stabilise their output and write the result in the interstage latches/register. SoEX
simply forward the registers whileID
initially uses the old value but the new one arrives in time for its value to propagates through theID
conditions checking gates. Basically, like you said with the falling/rising edge though this may actually be a combinatoric (not clock based) and not a sequential net (which would make it a "pipelined" ID stage). – Margaret Bloom