The thing you have to consider are branch delay slots.
First let's handle the case where they're off. This is the default for simulators like spim
and mars
. Things are simple:
5000: jalr $10 # (1) $31 will have 5004
5004: nop # (2) this executed upon return
This is the way most architectures work.
But, mips has [the aforementioned] branch delay slots.
If the delays are enabled [in simulators] or real hardware, after every transfer of control instruction (e.g. branch, jump, jal, jalr) is a single instruction that follows in the delay slot that is unconditionally executed before the branch is actually taken [or not]:
5000: jalr $10 # (1) $31 will have 5008
5004: nop # (2) this executed _before_ branch taken
5008: nop # (3) this executed upon return
So, the effective execution order is actually (2), (1), (3).
In the general case, you have a three step sequence:
5000: beqz $10,foobar # (1) conditional branch to foobar
5004: nop # (2) executed _before_ branch taken
5008: nop # (3) executed _after_ if branch _not_ taken
Once again, the effective execution order will be (2), (1). Then, either the first instruction of foobar
is executed [if the branch was taken] or the instruction at 5008
(3) will be executed if the branch is not taken.
Okay, you may be asking why?
In early MIPS chips, instructions were prefetched. For example, the instruction for cycle N+1 was prefetched [and possibly predecoded] in cycle N (a one cycle delay).
So, on cycle N, the instruction execution unit is executing the instruction fetched in cycle N-1 (e.g. 5000), the instruction prefetch unit is fetching the next instruction (at 5004). They overlap with the one cycle delay. In cycle N+1, the execution unit is executing the prefetched instruction (at 5004) and the prefetch unit is prefetching the next instruction (at 5008).
This works great until a conditional transfer of control instruction is encountered.
Without the delay slot, the processor would have to stall, and the instruction after the branch that got prefetched on the same cycle as the branch was executed would be wasted. With the delay slot execution, you can usually populate the slot with something useful, so the prefetch needn't be wasted.
But, it does makes things a bit more complicated.