14
votes

In Volume 3 of the Intel Manuals it contains the description of a hardware event counter:

BACLEAR_FORCE_IQ

Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

I always thought the Branch Address Calculator performs the static prediction algorithm (when the Branch Target Buffer contains no branch entry)?

Can anybody confirm which of the above two are correct? I cannot find anything.

2
I deleted my answer since it wasn't helpful. But I noticed that the Intel Optimization reference manual says: "The Intel Core microarchitecture does not use the static prediction heuristic. However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default."Gabriel Southern

2 Answers

3
votes

If the conditional branch target is not found in the Target Array

How can it not be found? you mask it with a bit mask to find the index into the table and get the next branch target.

Well if you after you read the result check that the call address does not match the tag on the result you have a "not taken" result.

At this point we get to the second part of the statement.

and the IQ predicts that the branch is taken

So branch target says "not taken" and the IQ predicts that it will be taken we have a contradiction.

To solve the contradiction the IQ wins as the branch target is just "if we jump, we jump here", but the IQ predicts if we jump or not based on a lot more logic.

Hence

then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

Which is good in a 14-19 stage pipeline. The 8 cycles is if the IQ can read the actual target address from the instruction (combined with PC), if the value needs to be read in a register (that is possible not yet retired) it could take a bit longer.

3
votes

Yes. Modern Intel processors use at least one static prediction technique and at least one dynamic prediction technique (such as the L2 BPU mentioned in the description of the performance event). Static prediction is discussed in the Intel optimization manual, but it does not clearly say where static prediction happens exactly. However, the description of multiple performance events related to branch prediction, such as BACLEAR_FORCE_IQ, indicate that it is implemented in the IQ unit. I think that this is the place where static branch prediction makes most sense.

The BPU first guesses where the branch instructions are most likely to be in the (to be) fetched instruction stream bytes (32 bytes per cycle in Haswell, twice the fetch unit width). Then, based, on the virtual instruction address(s) of the instruction(s) that are predicted to be control transfer instruction(s), the BPU consults its buffers (specifically, the "branch target buffer" or the "target array"), to make more predictions regarding the predicted branches (direction and target address). However, in some cases the BPU misses in its buffers or it might mispredict the location(s) of the branch instruction(s) in the instruction stream bytes or there could be more branches than the BPU could handle. Whatever the case is, whatever prediction is makes, they all get passed with the instruction stream bytes to the instruction queue unit. This is the earliest place in the pipeline where it is known where each instruction begins and ends and which of the instructions may transfer control.

The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit.

This part of the event description should make sense to you now. Note that static branch prediction is mostly only used to predict directions, not target addresses.

If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken...

The simple static branch predictor is only used when the BPU fails to make a prediction. So the first part of the condition makes sense. The second part, however, basically says that if the IQ predicts not taken, then nothing needs to be done. This indicates that the fetch unit will by default continue fetching code from the fall-through path on a BPU failure.

...then the IQ will force the Branch Address Calculator to issue a BACLEAR

So if the static predictor predicts taken, then it's better to do something about that. One intuitive thing is to flush everything above the IQ and tell the fetch unit to stop fetching bytes. That's what the BACLEAR signal does.This situation is called a frontend resteering. It'd be nice if we could tell the fetch unit where to fetch from as well, but we my not know the branch target address yet. Even if the address is embedded within the instruction (as an immediate operand), the IQ may not be to just extract it and forward to the fetch unit. We can just do nothing and wait until the address is calculated, thereby potentially saving power and energy. Or we can provide the BPU with the address (now that we know exactly where the branch instruction is) and let the BPU try again. Perhaps, the purpose of the "Branch Address Calculator", is to not only send the BACLEAR signal, but also try to determine the address as early as possible.

Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

It's not clear to me what the 8 cycle bubble accounts for. One possible interpretation is that the flushing that is caused by BACLEAR takes about 8 cycles, but the fetch unit might still be idle waiting for the address from which it should fetch. Determining the target address may take more than 8 cycles, depending on how it gets calculated and the surrounding code. Or it could mean that, on average, it take only about 8 cycles to fully resteer the front end and begin fetching from the target address. In addition, this 8 cycles penalty may not actually be on the critical path, so it may not impact the overall performance.

In summary, BACLEAR_FORCE_IQ occurs when a conditional branch (and only conditional branches) misses in the BPU (not any other BPU failure) and the IQ predicts taken.

I think that the BAC is used to handle any branch misprediction situation, not just by the IQ. Other performance events indicate that.