Understanding stalls and branch delay slots

Question

I am taking a course on Computer Architecture. I found this website from another University which has notes and videos which are helping me thus far: CS6810, Univ of Utah. I am working through some old homework assignments posted on that site, in particular this one. I am trying to understand pipelining and related concepts, specifically stalls and branch delay slots.

I am looking now at the first question from that old homework assignment and am unsure of how to do these problems.

The question is as follows:

Consider the following code segment, where the branch is taken 30% of the time and not taken 70% of the time.

R1 = R2 + R3

R4 = R5 + R6

R7 = R8 + R9

if R10 = 0, branch to linex

R11 = R12 + R13

R14 = R11 + R15

R16 = R14 + R17

...

linex: R18 = R19 + R20

R21 = R18 + R22

R23 = R18 + R21

...

Consider a 10-stage in-order processor, where the instruction is fetched in the first stage, and the branch outcome is known after three stages. Estimate the CPI of the processor under the following scenarios (assume that all stalls in the processor are branch-related and branches account for 15% of all executed instructions):

On every branch, fetch is stalled until the branch outcome is known.

Every branch is predicted not-taken and the mis-fetched instructions are squashed if the branch is taken.

The processor has two delay slots and the two instructions following the branch are always fetched and executed, and

3.1. You are unable to find any instructions to fill the delay slot.

3.2. You are able to move two instructions before the branch into the delay slot.

3.3. You are able to move two instructions after label "linex" into the delay slot.

3.4. You are able to move one (note: one, not two!) instruction immediately after the branch (in the original code) into the delay slot.

I am unsure of how to even begin to look at this question. I have read all the notes and watched the videos on that site and have read sections from the H&P book but am still confused on this problem. If anyone has the time, I would appreciate someone helping me step through this question. I just need to know how to begin to conceptualize the answers.

Paul A. Clayton Paul A. Clayton · Accepted Answer · 2013-10-02T20:33:23

In the described pipeline the direction and target of a conditional branch is not available until the end of the third cycle, so the correct next instruction after the branch cannot be fetched (with certainty) until the beginning of the fourth cycle.

Design 1

An obvious way to handle the delayed availability of the address of the instruction after the branch is simply to wait. This is what the design 1 does by stalling for two cycles (which is equivalent to fetching two no-ops that are not part of the actual program). This means that for both taken and not taken paths two cycles will wasted, just as if two no-op instructions had been inserted by the compiler.

Here are diagrams of the pipeline (ST is a stall, NO is a no-op, XX is a canceled instruction, UU is a useless instruction, I1, I2, and I3 are the three instructions before the branch [in the original program order before filling any delay slots], BI is the branch instruction, I5, I6, and I7 are the fall-through instructions after the branch, I21, I22, and I23 are the instructions at the start of the taken path; IF is the instruction fetch stage, DE is decode, BR is branch resolve, S1 is the stage after BR):

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I3  I2  I1         BI  I3  I2  I1
cycle 2  ST  BI  I3  I2         ST  BI  I3  I2
cycle 3  ST  ST  BI  I3         ST  ST  BI  I3
cycle 4  I21 ST  ST  BI         I5  ST  ST  BI
cycle 5  I22 I21 ST  ST         I6  I5  ST  ST

Design 2

To avoid having to detect the presence of a branch by the end of the IF stage and to allow some useful work to be done sometimes (in the not taken case), rather than having hardware effectively insert no-ops into the pipeline (i.e., stall fetch after the branch) the hardware can treat the branch as any other instruction until it is resolved in the third pipeline stage. This is predicting all branches as not taken. If the branch is taken, then the two instructions fetched after the branch are canceled (effectively turned into no-ops). This is the design 2:

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I3  I2  I1         BI  I3  I2  I1
cycle 2  I5  BI  I3  I2         I5  BI  I3  I2
cycle 3  I6  I5  BI  I3         I6  I5  BI  I3
cycle 4  I21 XX  XX  BI         I7  I6  I5  BI
cycle 5  I22 I21 XX  XX         I8  I7  I6  I5

Design 3

Always predicting a branch to be not taken will waste two cycles whenever a branch is taken, so a third mechanism was developed to avoid this waste--the delayed branch. In a delayed branch, the hardware always executes (does not cancel) the delay slot instructions after the branch (two instructions in the example). By always executing the delay slot instructions, the pipeline simplified. The compiler's job is to try to fill these delay slots with useful instructions.

Instructions taken from before the branch (in the program without delayed branches) will be useful regardless of which path is taken (but dependencies can prevent the compiler from scheduling any such instructions after the branch). The compiler can fill a delay slot with an instruction from the taken or not taken path, but such an instruction cannot be one that overwrites state used by the other path (or after the paths join) since delay slot instructions are not canceled (unlike with prediction). (If both paths join--as is common for if-then-else constructs--, then delay slots could potentially be filled from the join point; but such instructions are usually dependent on instructions from at least one of the paths before the join, which dependency would prevent them from being used in delay slots.) If the compiler cannot find a useful instruction, it must fill the delay slot with a no-op.

In case 3.1 (the worst case for a delayed branch design), the compiler could not find any useful instructions to fill the delay slots and so must fill them with no-ops:

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I3  I2  I1         BI  I3  I2  I1
cycle 2  NO  BI  I3  I2         NO  BI  I3  I2
cycle 3  NO  NO  BI  I3         NO  NO  BI  I3
cycle 4  I21 NO  NO  BI         I5  NO  NO  BI
cycle 5  I22 I21 NO  NO         I6  I5  NO  NO

This is equivalent in performance to design 1 (stall two cycles).

In case 3.2 (the best case for a delayed branch design), the compiler found two instructions from before the branch to fill the delay slots:

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I1  ...            BI  I1  ...
cycle 2  I2  BI  I1  ...        I2  BI  I1 ...
cycle 3  I3  I2  BI  I1         I3  I2  BI  I1
cycle 4  I21 I3  I2  BI         I5  I3  I2  BI
cycle 5  I22 I21 I3  I2         I6  I5  I3  I2

In this case, all pipeline slots are filled with useful instructions regardless of whether the branch is taken or not taken. The performance (CPI) is the same as for an ideal pipeline without delayed resolution of branches.

In case 3.3, the compiler filled the delay slots with instructions from the taken path:

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I3  I2  I1         BI  I3  I2  I1
cycle 2  I21 BI  I3  I2         I21 BI  I3  I2
cycle 3  I22 I21 BI  I3         I22 I21 BI  I3
cycle 4  I23 I22 I21 BI         I5  UU  UU  BI
cycle 5  I24 I23 I22 I21        I6  I5  UU  UU

In the not taken path I21 and I22 are useless. Although they are actually executed (and update state), this state is not used in the not taken path (or after any joining of the paths). For the not taken path, it is as if the delay slots had been filled with no-ops.

In case 3.4, the compiler could only find one safe instruction from the not taken path and must fill the other delay slot with a no-op:

         Taken                  Not taken
         IF  DE  BR  S1 ...     IF  DE  BR  S1 ...
cycle 1  BI  I3  I2  I1         BI  I3  I2  I1
cycle 2  I5  BI  I3  I2         I5  BI  I3  I2
cycle 3  NO  I5  BI  I3         NO  I5  BI  I3
cycle 4  I21 NO  UU  BI         I6  NO  I5  BI
cycle 5  I22 I21 NO  UU         I7  I6  NO  I5

For the taken path, one useless instruction and one no-op are executed, wasting two cycles. For the not taken path, one no-op is executed, wasting one cycle.

Calculating CPI

The formula for calculating CPI in this case is:

%non_branch * CPI_non_branch + %branch * CPI_branch

CPI_branch is calculated by accounting for the time taken for the branch itself (baseCPI_branch) and the percentage of times the branch is taken with the wasted cycles when it is taken and the percentage of times the branch is not taken with the wasted cycles when it is not taken. So the CPI_branch is:

baseCPI_branch + (%taken * wasted_cycles_taken) + 
                 (%not_taken * wasted_cycles_not_taken)

In an ideal scalar pipeline, each instruction takes one cycle, i.e., the Cycles Per Instruction is 1. In this example, non-branch instructions behave as if the pipeline were ideal ("all stalls in the processor are branch-related"), so each non-branch instruction has a CPI of 1. Likewise, the baseCPI_branch (excluding wasted cycles from stalls, no-ops, et al.) is 1.

Based on the pipeline diagrams above, one can determine the number of cycles that are wasted in the taken and in the not taken paths. The example gives the percentage of branches and the percentages of branches that are taken and not taken.

For the design 1, both taken and not taken paths waste 2 cycles, so the CPI_branch is:

1 + (0.3 * 2) + (0.7 *2) = 3

and the total CPI is therefore:

(0.85 * 1) + (0.15 * 3) = 1.3