6
votes
        addiu   $6,$6,5
        bltz    $6,$L5
        nop
        ...
$L5:

How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss? (MIPS originally stood for Microprocessor Without Interlocked Pipeline Stages, and had a load delay slot instead of interlocking.)

Original MIPS I is a classic 5-stage RISC IF ID EX MEM WB design that hides all of its branch latency with a single branch-delay slot by checking branch conditions early, in the ID stage (correction: this was the mistake, go read this answer; don't be misled by the rest of the details in the question based on this false premise). Which is why it's limited to equal/not-equal, or sign-bit checks like lt or ge zero, not lt between two registers that would need carry-propagation through an adder.

Doesn't this mean that branches need their input ready a cycle earlier than ALU instructions? The bltz enters the ID stage in the same cycle that addiu enters EX.

MIPS I (aka R2000) uses bypass forwarding from EX-output to EX-input so normal integer ALU instructions (like a chain of addu/xor) have single-cycle latency and can run in consecutive cycles.


MIPS stands for "Microprocessor without Interlocked Pipeline Stages", so it doesn't detect RAW hazards; code has to avoid them. (Hence load-delay slots on first-gen MIPS, with MIPS II adding interlocks to stall in that case, invalidating the acronym :P).

But I never see any discussion of calculating the branch condition multiple instructions ahead to avoid a stall. (The addiu/bltz example was emitted by MIPS gcc5.4 -O3 -march=mips1 on Godbolt, which does respect load-delay slots, filling with nop if needed.)


Does it use some kind of trick like EX reading inputs on the falling edge of the clock, and ID not needing forwarded register values until the rising edge? (With EX producing its results early enough for that to work)

I guess that would make sense if the clock speed is capped low enough for cache access to be single-cycle.

Stalling or bubble in MIPS claims that lw + a beq on the load result needs 2 stall cycles because it can't forward. That's not accurate for actual MIPS I (unless gcc is buggy). It does mention half clock cycles, though, allowing a value to be written and then read from the register file in the same whole cycle.

2
I seem to recall seeing a diagram of actual MIPS propagation-delay timings for parts of various stages sometime in the last few months / half a year. I think it did have the EX result ready early and have ID not need it until the 2nd phase of the clock. But I don't remember where I saw that, or if it was actually for MIPS instead of some other ISA. - Peter Cordes
I'm pretty sure the CPU just stalls (inserts a bubble) and that the "without Interlocked Pipeline Stages" was never true for any commercially released MIPS processor. It's hard to be sure because pretty all I can find on the MIPS pipeline are course slides that might not be talking about a real CPU. Note that stall would also be required with lw $6, ($6) nop bltz $6, $L5 because the one instruction load delay slot is not enough. - Ross Ridge
Part of the confusion here maybe the result of the fact that the MIPS I architecture wasn't the first MIPS architecture, before it came the Stanford MIPS architecture. This original architecture was the one that didn't have interlocks. It also didn't have byte addressing. ethz.ch/content/dam/ethz/special-interest/infk/inst-cs/lst-dam/… - Ross Ridge
@RossRidge: Unfortunately we can't assume that gcc makes optimal code, but its instruction scheduling in an unrolled loop (godbolt.org/z/WLdSCz) doesn't avoid computing branch inputs right before testing them (which it could and should if that leads to a stall, for performance not correctness reasons). Although from playing around with variations on the loop, it does often avoid that when there's a bit more work in the loop. So we just can't tell if it's a missed opt or if it's actually fine on MIPS I. (And GCC's MIPS tuning cares some about superscalar MIPS.) - Peter Cordes
As I see it, when bltz enters ID along with addiu entering EX, they have a whole clock to stabilise their output and write the result in the interstage latches/register. So EX simply forward the registers while ID initially uses the old value but the new one arrives in time for its value to propagates through the ID conditions checking gates. Basically, like you said with the falling/rising edge though this may actually be a combinatoric (not clock based) and not a sequential net (which would make it a "pipelined" ID stage). - Margaret Bloom

2 Answers

3
votes

TL:DR: Classic MIPS I checks branch conditions in the first half cycle of EX, so forwarding to them is not special.

IF only needs the address in the 2nd half of a cycle so EX can forward to it.

These factors combine to give only 1 cycle of branch latency (hidden by 1 delay slot), with no problem for branches that depend the previous ALU instruction.


It was definitely safe to run sltu / beq on MIPS I (R2000). That's listed as the expansion for the bgeu pseudo-instruction, for example, in real MIPS manuals and books with no caveat about it being unsafe on MIPS R2000 or any other MIPS.

GCC uses sequences like that in practice even with march=mips1 which respects load-delay slots and other features of real MIPS R2000.


MIPS's IF doesn't need an address until the 2nd half of a clock cycle, allowing EX to produce it quickly enough.

From See MIPS Run by Dominic Sweetman, (covering MIPS I through MIPS IV), Chapter 1.5.1 Constraints on Instructions

We’ll see later that efficient conditional branching means that the decision about whether to branch or not has to be squeezed into only half a pipeline stage; the architecture helps out by keeping the branch decision tests very simple. So conditional branches (in MIPS) test a single register for sign/zero or a pair of registers for equality.

Their Figure 1.3: The pipeline and branch delays shows the branch condition being calculated in the first half of EX, and used in the 2nd half of IF, for a total branch latency of only 1 cycle / pipeline stage (ID) / instruction. IF doesn't actually start until the 2nd half of a clock cycle. (And continues into ID. The actual decode/register-fetch of ID only takes the last fraction of a clock cycle.)

That has the same end result as what I suggested in the question (check branch condition by the end of ID), except it only requires EX -> EX forwarding to branch on the result of the previous ALU instruction.

Perhaps I was misremembering or misinterpreting something I'd read previously about the half-cycle branch-decision. This half-cycle thing might well be exactly what I remembered seeing.

Further quoting See MIPS Run 1.5.5 Programmer-Visible Pipeline Effects

• Delayed branches: [first paragraph explains the branch-delay slot]

If nothing special was done by the hardware, the decision to branch or not, together with the branch target address, would emerge at the end of the ALU pipestage — in time to fetch the branch target instruction instead of the next instruction but two. But branches are important enough to justify special treatment, and you can see from Figure 1.3 [described above] that a special path is provided through the ALU to make the branch address available half a clock cycle early. Together with the odd half-clock-cycle shift of the instruction fetch stage, that means that the branch target can be fetched in time to become the next but one, so the hardware runs the branch instruction, then the branch delay slot instruction, and then the branch target — with no other delays.

... [don't waste your branch-delay slots]

... [many MIPS assemblers will reorder instructions for you if it's safe, to hide branch delay]

See MIPS Run has a foreword by John L. Hennessy, Founder of MIPS Technologies etc. etc.. That's not proof he signed off on everything in the book being accurate, but it's good evidence that the book's description of how MIPS managed this trick is accurate.

It's easily understandable and 100% plausible; we already know the data cache has single-cycle fetch latency (after address-generation in the EX stage).

0
votes

You are actually asking two questions:

  1. Is that safe on MIPS I?
  2. If so, how?

Is that safe on MIPS I?

I have seen different block diagrams of MIPS CPUs. Most of them perform the branch decision in the EX or even in the MEM stage instead of the ID stage.

Of course such designs will react differently when your example code is executed.

Without an official statement from the CPU manual of the CPU you are really using, your question cannot be answered with certainty.

(Paul Clayton's answer on Is that true if we can always fill the delay slot there is no need for branch prediction? agrees that one delay slot does fully hide branch latency on MIPS R2000, but not MIPS R4000. So that's good evidence that real commercial MIPS CPUs work the way the question assumes, despite the existence of various implementations that might not exactly follow the MIPS ISA.)

If so, how?

Doesn't this mean that branches need their input ready a cycle earlier than ALU instructions?

No.

The key is the bypass forwarding logic. Let's take a look at the following example:

add  $A, $B, $C      ; Currently in MEM stage
or   $D, $E, $F      ; Currently in EX stage
bltz $G, someLabel   ; Currently in ID stage

(While A, B, ... G are GPR numbers.)

The bypass forwarding logic for the EX phase (or instruction) contains a multiplexer that works the following way (pseudo code):

if E = A
    take ALU input from EX/MEM shift register output
else
    take ALU input from ID/EX shift register output
end-if

It is this multiplexer which allows you to use the result of some instruction (add) in the following one (or).

Of course the same can be done for the ID phase using a 3-way multiplexer:

if G = D
    take branch decision input from ALU output
else if G = A
    take branch decision input from EX/MEM shift register output
else
    take branch decision input from register bank output
end-if

Doing this, the signal propagation time will increase by the time needed in the EX phase. This means that this will limit the clock frequency of the processor.

However, the result of some instruction can already be used in the ID stage of the next instruction without needing an additional clock cycle.