Why did Intel change the static branch prediction mechanism over these years?

Question

From here I know Intel implemented several static branch prediction mechanisms these years:

80486 age: Always-not-taken
Pentium4 age: Backwards Taken/Forwards Not-Taken
Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see Matt G's experiment here.

And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago.

I know static branch prediction is (far?) less important than dynamic, but in quite a few situations, CPU will be completely lost and programmers(with compiler) are usually the best guide. Of course these situations are usually not performance bottleneck, because once a branch is frequently executed, the dynamic predictor will capture it.

Since Intel no longer clearly statements the dynamic prediction mechanism in its document, the builtin_expect() of GCC can do nothing more than removing the unlikely branch from hot path.

I am not familiar with CPU design and I don't know what exactly mechanism Intel use nowadays for its static predictor, but I still feel the best mechanism for Intel should be to clearly document his CPU 'where I plan to go when dynamic predictor failed, forward or backward', because usually the programmer is the best guide at that time.

Update:
I found the topics you mentioned gradually go beyond my knowledge. Some dynamic prediction mechanism and CPU internal details are involved here which I can't learn within two or three days. So allow me quit your discussion temporarily and recharge.
Any answer is still welcome here, maybe will help more people

BeeOnRope BeeOnRope · Accepted Answer · 2018-08-14T19:19:14

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can only be made after decode (which comes after fetch).

In more detail...

CPU Pipelining

Briefly, during execution needs to fetch instructions from memory, decode those instructions and then execute them¹. On a high-performance CPU, these stages will be pipelined, meaning that they will all generally be happening in parallel - but for different instructions at any given moment. You could read a bit about this on Wikipedia, but keep in mind that modern CPUs are more complex, generally with many more stages.

On a modern x86, with a complex-to-decode variable-length instruction set, there may be many pipeline "stages" involved simply in fetching and decoding instructions, perhaps a half-dozen or more. Such instructions are also superscalar, capable of executing several instructions at once. This implies that when executing at peak efficiency, there will be many instructions in flight, in various stages of being fetched, decoded, executed and so on.

Redirecting Fetch

The effect of a taken branch is felt on the entire initial portion (usually called the front-end) of the pipeline: when you jump to a new address, you need to fetch from that new address, decode from that new address, etc. We say that a taken branch needs to redirect fetch. This puts certain restrictions on the information that branch prediction can use in order to perform efficiently.

Consider how static prediction works: it looks at the instruction and if it is a branch, compares its target to see if it is "forwards" or "backwards". All this must happen largely after decoding has occurred, since that's when the actual instruction is known. However, if a branch is detected and is predicted taken (e.g., a backwards jump), the predictor needs to redirect fetch, which is many pipeline stages earlier. By the time fetch gets redirected after decoding instruction N there are already many subsequent instructions that were fetched and decoded on the wrong (not taken) path. Those have to be thrown away. We say that a bubble is introduced in the front-end.

The upshot of all of this is that even if static prediction is 100% correct, it is very inefficient in the taken branch case since the front-end pipelining is defeated. If there are 6 pipeline stages between fetch and the end of decode, every taken branch causes a 6-cycle bubble in the pipeline, with the generous assumption that the prediction itself and flushing bad-path instructions take "zero cycles".

Dynamic Prediction to the Rescue

Modern x86 CPUs, however, are able to execute taken branches at up to 1 every cycle, much better than the limit even for perfectly predicted static execution. To achieve this, the predictor usually cannot use information available after decoding. It must be able to redirect fetch every cycle and use only inputs available with a latency of one cycle after the last prediction. Essentially, this means predictor is basically a self-contained process that uses only its own output as input for the next cycle's prediction.

This is the dynamic predictor on most CPUs. It predicts where to fetch from next cycle, and then based on that prediction it predicts where to fetch from the cycle after that, and so on. It doesn't use any information about the decoded instructions, but only past behavior of the branches. It does eventually get feedback from the execution units about the actual direction of the branch, and updates its predictions based on that, but this all happens essentially asynchronously, many cycles after the relevant instruction has passed through the predictor.

Adding It Up

All of this serves to chip away at the usefulness of static prediction.

First, the prediction comes too late, so even when working perfectly it implies a bubble of 6-8 cycles on modern Intel for taken branches (indeed, these are observed figures from so-called "front-end resteers" on Intel). This dramatically changes the cost/benefit equation for making a prediction at all. When you have a dynamic predictor before fetch making a prediction, you more-or-less want to make some prediction and if it has even 51% accuracy it will probably pay off.

For static predictions, however, you need to have high accuracy if you ever want to make a "taken" prediction. Consider, for example, an 8-cycle front-end resteer cost, versus a 16 cycle "full mispredict" cost. Let's say in some program that cold backwards branches are taken twice as often as not taken. This should be a win for static branch prediction that predicts backwards-taken, right (compared to a default strategy of always "predicting"² not-taken)?

Not so fast! If you assume an 8-cycle re-steer cost and a 16-cycle full mispredict cost, they end up having the same blended cost of 10.67 cycles - because even in the correctly predicted taken case where is an 8 cycle bubble, but in the fall-through case there is no corresponding cost for the no-static-prediction case.

Add to that that the no-static-prediction case already gets the other half of static prediction correct (the forward-branches not-taken case), the utility of static prediction is not as large as one would imagine.

Why the change now? Perhaps because the front-end part of the pipeline has lengthened compared to the other parts, or because the increasing performance and memory of the dynamic predictors means that fewer cold branches are eligible for static prediction at all. Improving performance of static predictors also means that the backwards-taken prediction becomes less strong for cold branches, because loops (which are the reason for the backwards-taken rule) are more frequently remembered by the dynamic predictor.

Saving Dynamic Prediction Resources

The change could also be because of an interaction with dynamic prediction: one design for a dynamic predictor is not to use any branch prediction resources at all for a branch that is never observed to be taken. Since such branches are common, this can save a lot of history table and BTB space. However, such a scheme is inconsistent with a static predictor that predicts backwards branches as taken: if a backwards branch is never taken, you don't want the static predictor to pick up this branch, and predict it as taken and so messing up your strategy of saving resources for not-taken branches.

¹ ... and also then do more stuff like retire, them - but what happens after execute mostly isn't important for our purposes here.

² I put "predicting" in scare-quotes here because in a way it's not even predicting: not-taken is the default behavior of fetch and decode in the absence of any prediction to the contrary, so it's what you get if you don't put in any static prediction at all, and your dynamic predictor doesn't tell you otherwise.