Yes. Modern Intel processors use at least one static prediction technique and at least one dynamic prediction technique (such as the L2 BPU mentioned in the description of the performance event). Static prediction is discussed in the Intel optimization manual, but it does not clearly say where static prediction happens exactly. However, the description of multiple performance events related to branch prediction, such as BACLEAR_FORCE_IQ
, indicate that it is implemented in the IQ unit. I think that this is the place where static branch prediction makes most sense.
The BPU first guesses where the branch instructions are most likely to be in the (to be) fetched instruction stream bytes (32 bytes per cycle in Haswell, twice the fetch unit width). Then, based, on the virtual instruction address(s) of the instruction(s) that are predicted to be control transfer instruction(s), the BPU consults its buffers (specifically, the "branch target buffer" or the "target array"), to make more predictions regarding the predicted branches (direction and target address). However, in some cases the BPU misses in its buffers or it might mispredict the location(s) of the branch instruction(s) in the instruction stream bytes or there could be more branches than the BPU could handle. Whatever the case is, whatever prediction is makes, they all get passed with the instruction stream bytes to the instruction queue unit. This is the earliest place in the pipeline where it is known where each instruction begins and ends and which of the instructions may transfer control.
The IQ is also responsible for providing conditional branch prediction
direction based on a static scheme and dynamic data provided by the L2
Branch Prediction Unit.
This part of the event description should make sense to you now. Note that static branch prediction is mostly only used to predict directions, not target addresses.
If the conditional branch target is not found in the Target Array and
the IQ predicts that the branch is taken...
The simple static branch predictor is only used when the BPU fails to make a prediction. So the first part of the condition makes sense. The second part, however, basically says that if the IQ predicts not taken, then nothing needs to be done. This indicates that the fetch unit will by default continue fetching code from the fall-through path on a BPU failure.
...then the IQ will force the Branch Address Calculator to issue a
BACLEAR
So if the static predictor predicts taken, then it's better to do something about that. One intuitive thing is to flush everything above the IQ and tell the fetch unit to stop fetching bytes. That's what the BACLEAR signal does.This situation is called a frontend resteering. It'd be nice if we could tell the fetch unit where to fetch from as well, but we my not know the branch target address yet. Even if the address is embedded within the instruction (as an immediate operand), the IQ may not be to just extract it and forward to the fetch unit. We can just do nothing and wait until the address is calculated, thereby potentially saving power and energy. Or we can provide the BPU with the address (now that we know exactly where the branch instruction is) and let the BPU try again. Perhaps, the purpose of the "Branch Address Calculator", is to not only send the BACLEAR signal, but also try to determine the address as early as possible.
Each BACLEAR asserted by the BAC generates approximately an 8 cycle
bubble in the instruction fetch pipeline.
It's not clear to me what the 8 cycle bubble accounts for. One possible interpretation is that the flushing that is caused by BACLEAR takes about 8 cycles, but the fetch unit might still be idle waiting for the address from which it should fetch. Determining the target address may take more than 8 cycles, depending on how it gets calculated and the surrounding code. Or it could mean that, on average, it take only about 8 cycles to fully resteer the front end and begin fetching from the target address. In addition, this 8 cycles penalty may not actually be on the critical path, so it may not impact the overall performance.
In summary, BACLEAR_FORCE_IQ
occurs when a conditional branch (and only conditional branches) misses in the BPU (not any other BPU failure) and the IQ predicts taken.
I think that the BAC is used to handle any branch misprediction situation, not just by the IQ. Other performance events indicate that.