0
votes

I'm trying to draw out the stalls on a fully bypassed MIPS processor. I'm a bit confused as to how it would work on a conditional branch like beq when it follows lw. I now that we cannot retrieve the value from lw until it is written to memory but I also know that the branch needs to retrieve its registers for the conditional by the decode stage. Assuming the stages of the pipeline are F D E M W, which of these would be the correct forwarding path?

lw $t0, 0($a0)     F D E M W 
                         |            # M-D bypass
beq $t0, $0, ret     F D D E M W      # mandatory stall from the lw
lw $t0, 0($a0)     F D E M W 
                           |          # W-E bypass
beq $t0, $0, ret     F D D E M W      # mandatory stall from the lw
lw $t0, 0($a0)     F D E M W 
                         \
                          \         #M-E bypass
beq $t0, $0, ret     F D D E M W      # mandatory stall from the lw
1

1 Answers

0
votes

As per How does MIPS I handle branching on the previous ALU instruction without stalling?, conditional branches need their input(s) forwarded to the EX stage.

So here it would be M->E forwarding, from the end of M to the start of E. Your 3rd diagram has a comment that says "M-E", but you've actually drawn forwarding from the end of E (or start of M?) to E.

lw $t0, 0($a0)     F D E M W 
                          \           # M-E bypass
beq $t0, $0, ret     F D D E M W      # mandatory stall from the lw

(I'm not sure if it would be more correct to show it stalling in E, like FDEEMW; I don't think so since Decode is responsible for figuring out whether to stall.)


In your offset diagram, where you show the stages shifted later in time, a vertical line would mean forwarding backwards in time. So 1 and 2 are impossible and can be ruled out. With only 1 stall cycle, you can't forward backwards the length of 3 stages (remember it's from the end of one stage to the start of another, so it's 3 stages counting both ends). Although to be fair, if write-back happens in the first half cycle, and register read happens in the 2nd half cycle, then it works.

Forwarding is always to E, whether it's from M or E. Decode is the stage that figures out what forwarding is needed and reads the register file to feed data to E. If forwarding is needed, you just forward straight to the place that needs it, not a stage earlier, to minimize latency / number of stall cycles.

(Forwarding to M would be possible if you wanted to do that for the store-data operand of a store; E only needs the store-address operand. I think I've seen forwarding to M mentioned in an earlier Q&A on SO so I won't dig deeper on that here.)


This of course assumes a MIPS with interlocked loads. Classic MIPS I (R2000) would not detect the RAW hazard and wouldn't stall, so beq would use the old value of $t0. Unless the load missed in cache, then the pipeline would stall until the load arrived, using the just-loaded value. i.e. classic MIPS I has a load delay slot; don't use a load result in the instruction right after a load.

Later MIPS added interlocks so software could avoid filling with NOP, saving I-cache footprint for cases where the compiler couldn't find anything to fill load delay slots. Branch-delay slots are architecturally visible and couldn't be removed without breaking machine-code compat, so that took much longer to get rid of (MIPS32r6 / MIPS64r6 reorganized opcodes and introduced new branch instructions).