In the described pipeline the direction and target of a conditional branch is not available until the end of the third cycle, so the correct next instruction after the branch cannot be fetched (with certainty) until the beginning of the fourth cycle.
Design 1
An obvious way to handle the delayed availability of the address of the instruction after the branch is simply to wait. This is what the design 1 does by stalling for two cycles (which is equivalent to fetching two no-ops that are not part of the actual program). This means that for both taken and not taken paths two cycles will wasted, just as if two no-op instructions had been inserted by the compiler.
Here are diagrams of the pipeline (ST is a stall, NO is a no-op, XX is a canceled instruction, UU is a useless instruction, I1, I2, and I3 are the three instructions before the branch [in the original program order before filling any delay slots], BI is the branch instruction, I5, I6, and I7 are the fall-through instructions after the branch, I21, I22, and I23 are the instructions at the start of the taken path; IF is the instruction fetch stage, DE is decode, BR is branch resolve, S1 is the stage after BR):
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 ST BI I3 I2 ST BI I3 I2
cycle 3 ST ST BI I3 ST ST BI I3
cycle 4 I21 ST ST BI I5 ST ST BI
cycle 5 I22 I21 ST ST I6 I5 ST ST
Design 2
To avoid having to detect the presence of a branch by the end of the IF stage and to allow some useful work to be done sometimes (in the not taken case), rather than having hardware effectively insert no-ops into the pipeline (i.e., stall fetch after the branch) the hardware can treat the branch as any other instruction until it is resolved in the third pipeline stage. This is predicting all branches as not taken. If the branch is taken, then the two instructions fetched after the branch are canceled (effectively turned into no-ops). This is the design 2:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I5 BI I3 I2 I5 BI I3 I2
cycle 3 I6 I5 BI I3 I6 I5 BI I3
cycle 4 I21 XX XX BI I7 I6 I5 BI
cycle 5 I22 I21 XX XX I8 I7 I6 I5
Design 3
Always predicting a branch to be not taken will waste two cycles whenever a branch is taken, so a third mechanism was developed to avoid this waste--the delayed branch. In a delayed branch, the hardware always executes (does not cancel) the delay slot instructions after the branch (two instructions in the example). By always executing the delay slot instructions, the pipeline simplified. The compiler's job is to try to fill these delay slots with useful instructions.
Instructions taken from before the branch (in the program without delayed branches) will be useful regardless of which path is taken (but dependencies can prevent the compiler from scheduling any such instructions after the branch). The compiler can fill a delay slot with an instruction from the taken or not taken path, but such an instruction cannot be one that overwrites state used by the other path (or after the paths join) since delay slot instructions are not canceled (unlike with prediction). (If both paths join--as is common for if-then-else constructs--, then delay slots could potentially be filled from the join point; but such instructions are usually dependent on instructions from at least one of the paths before the join, which dependency would prevent them from being used in delay slots.) If the compiler cannot find a useful instruction, it must fill the delay slot with a no-op.
In case 3.1 (the worst case for a delayed branch design), the compiler could not find any useful instructions to fill the delay slots and so must fill them with no-ops:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 NO BI I3 I2 NO BI I3 I2
cycle 3 NO NO BI I3 NO NO BI I3
cycle 4 I21 NO NO BI I5 NO NO BI
cycle 5 I22 I21 NO NO I6 I5 NO NO
This is equivalent in performance to design 1 (stall two cycles).
In case 3.2 (the best case for a delayed branch design), the compiler found two instructions from before the branch to fill the delay slots:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I1 ... BI I1 ...
cycle 2 I2 BI I1 ... I2 BI I1 ...
cycle 3 I3 I2 BI I1 I3 I2 BI I1
cycle 4 I21 I3 I2 BI I5 I3 I2 BI
cycle 5 I22 I21 I3 I2 I6 I5 I3 I2
In this case, all pipeline slots are filled with useful instructions regardless of whether the branch is taken or not taken. The performance (CPI) is the same as for an ideal pipeline without delayed resolution of branches.
In case 3.3, the compiler filled the delay slots with instructions from the taken path:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I21 BI I3 I2 I21 BI I3 I2
cycle 3 I22 I21 BI I3 I22 I21 BI I3
cycle 4 I23 I22 I21 BI I5 UU UU BI
cycle 5 I24 I23 I22 I21 I6 I5 UU UU
In the not taken path I21 and I22 are useless. Although they are actually executed (and update state), this state is not used in the not taken path (or after any joining of the paths). For the not taken path, it is as if the delay slots had been filled with no-ops.
In case 3.4, the compiler could only find one safe instruction from the not taken path and must fill the other delay slot with a no-op:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I5 BI I3 I2 I5 BI I3 I2
cycle 3 NO I5 BI I3 NO I5 BI I3
cycle 4 I21 NO UU BI I6 NO I5 BI
cycle 5 I22 I21 NO UU I7 I6 NO I5
For the taken path, one useless instruction and one no-op are executed, wasting two cycles. For the not taken path, one no-op is executed, wasting one cycle.
Calculating CPI
The formula for calculating CPI in this case is:
%non_branch * CPI_non_branch + %branch * CPI_branch
CPI_branch is calculated by accounting for the time taken for the branch itself (baseCPI_branch) and the percentage of times the branch is taken with the wasted cycles when it is taken and the percentage of times the branch is not taken with the wasted cycles when it is not taken. So the CPI_branch is:
baseCPI_branch + (%taken * wasted_cycles_taken) +
(%not_taken * wasted_cycles_not_taken)
In an ideal scalar pipeline, each instruction takes one cycle, i.e., the Cycles Per Instruction is 1. In this example, non-branch instructions behave as if the pipeline were ideal ("all stalls in the processor are branch-related"), so each non-branch instruction has a CPI of 1. Likewise, the baseCPI_branch (excluding wasted cycles from stalls, no-ops, et al.) is 1.
Based on the pipeline diagrams above, one can determine the number of cycles that are wasted in the taken and in the not taken paths. The example gives the percentage of branches and the percentages of branches that are taken and not taken.
For the design 1, both taken and not taken paths waste 2 cycles, so the CPI_branch is:
1 + (0.3 * 2) + (0.7 *2) = 3
and the total CPI is therefore:
(0.85 * 1) + (0.15 * 3) = 1.3