4
votes

For a MIPS 5 stage pipeline, the branch target is known by the decode stage because it can be easily extracted if the branch offset it is in the instruction and also you read the registers in the decode stage.

So then for an out of order pipeline, you obviously run into a problem with with instructions like 'jr', which might use a register that hasn't been calculated yet. For uses like this, there is clear use for a branch target buffer.

But for an instruction like 'beq', I see the necessity for a branch predictor, but not for the branch target because you already know the branch offset and of course you know the current program counter so you can easily come up with the branch destination.

Are register jumps the only instructions that use the branch target buffer or am I missing something?

1
Much of the benefit of a Brach Target Buffer come from the fact that it allows you to predict the existence and target of a branch before even fetching the branch instruction, let alone decoding it.Chris Dodd
Why does that present a benefit? If you know a branch is coming, how does that allow you to improve the performance of your pipeline? Is it only used for prefetching purposes?Chris
@Chris Immediately after you've fetched the BEQ instruction you want to fetch the next instruction. You don't want to wait until the instruction has been decoded and its destination calculated.Ross Ridge
The point of the BTB, and branch prediction in general, is to avoid stalling the pipeline regardless of whether there's a cache or not.Ross Ridge
See What branch misprediction does the Branch Target Buffer detect? which explains that the fetch stage needs prediction to know what block to fetch next. See also Slow jmp-instruction for an x86 benchmark of a giant sequence of jump to next-instruction. (i.e. relative offset = 0). It slows down when the sequence is long enough not to fit in the BTB.Peter Cordes

1 Answers

3
votes

The fetch stage needs prediction to know what block to fetch next. Instruction cache has some latency but can be pipelined. DRAM has even more latency, but can still have multiple outstanding requests (depending on the memory controller or outer levels of cache). So the fetch stage needs block addresses multiple cycles ahead of the blocks that are currently arriving from memory/cache.

Decode doesn't happen until after fetch, so that's an extra stall cycle if you waited until decode to detect the existence of unconditional direct branches.

See What branch misprediction does the Branch Target Buffer detect? for more about this from an x86 perspective (where decode is expensive and takes multiple stages, so this is even more critical).

Also note that high-performance CPUs decode multiple instructions in parallel, and often have a queue between fetch and decode to absorb fetch bubbles. If the fetch stage predicts there's a taken branch (conditional or unconditional, doesn't matter), it can queue up instructions from the branch target instead of instructions after the branch.


See also Slow jmp-instruction for an x86 benchmark of a giant sequence of jump to next-instruction. (i.e. relative offset = 0). It slows down when the sequence is long enough not to fit in the BTB.