19
votes

From here I know Intel implemented several static branch prediction mechanisms these years:

  • 80486 age: Always-not-taken

  • Pentium4 age: Backwards Taken/Forwards Not-Taken

  • Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see Matt G's experiment here.

And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago.

I know static branch prediction is (far?) less important than dynamic, but in quite a few situations, CPU will be completely lost and programmers(with compiler) are usually the best guide. Of course these situations are usually not performance bottleneck, because once a branch is frequently executed, the dynamic predictor will capture it.

Since Intel no longer clearly statements the dynamic prediction mechanism in its document, the builtin_expect() of GCC can do nothing more than removing the unlikely branch from hot path.

I am not familiar with CPU design and I don't know what exactly mechanism Intel use nowadays for its static predictor, but I still feel the best mechanism for Intel should be to clearly document his CPU 'where I plan to go when dynamic predictor failed, forward or backward', because usually the programmer is the best guide at that time.

Update:
I found the topics you mentioned gradually go beyond my knowledge. Some dynamic prediction mechanism and CPU internal details are involved here which I can't learn within two or three days. So allow me quit your discussion temporarily and recharge.
Any answer is still welcome here, maybe will help more people

3

3 Answers

18
votes

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can only be made after decode (which comes after fetch).

In more detail...

CPU Pipelining

Briefly, during execution needs to fetch instructions from memory, decode those instructions and then execute them1. On a high-performance CPU, these stages will be pipelined, meaning that they will all generally be happening in parallel - but for different instructions at any given moment. You could read a bit about this on Wikipedia, but keep in mind that modern CPUs are more complex, generally with many more stages.

On a modern x86, with a complex-to-decode variable-length instruction set, there may be many pipeline "stages" involved simply in fetching and decoding instructions, perhaps a half-dozen or more. Such instructions are also superscalar, capable of executing several instructions at once. This implies that when executing at peak efficiency, there will be many instructions in flight, in various stages of being fetched, decoded, executed and so on.

Redirecting Fetch

The effect of a taken branch is felt on the entire initial portion (usually called the front-end) of the pipeline: when you jump to a new address, you need to fetch from that new address, decode from that new address, etc. We say that a taken branch needs to redirect fetch. This puts certain restrictions on the information that branch prediction can use in order to perform efficiently.

Consider how static prediction works: it looks at the instruction and if it is a branch, compares its target to see if it is "forwards" or "backwards". All this must happen largely after decoding has occurred, since that's when the actual instruction is known. However, if a branch is detected and is predicted taken (e.g., a backwards jump), the predictor needs to redirect fetch, which is many pipeline stages earlier. By the time fetch gets redirected after decoding instruction N there are already many subsequent instructions that were fetched and decoded on the wrong (not taken) path. Those have to be thrown away. We say that a bubble is introduced in the front-end.

The upshot of all of this is that even if static prediction is 100% correct, it is very inefficient in the taken branch case since the front-end pipelining is defeated. If there are 6 pipeline stages between fetch and the end of decode, every taken branch causes a 6-cycle bubble in the pipeline, with the generous assumption that the prediction itself and flushing bad-path instructions take "zero cycles".

Dynamic Prediction to the Rescue

Modern x86 CPUs, however, are able to execute taken branches at up to 1 every cycle, much better than the limit even for perfectly predicted static execution. To achieve this, the predictor usually cannot use information available after decoding. It must be able to redirect fetch every cycle and use only inputs available with a latency of one cycle after the last prediction. Essentially, this means predictor is basically a self-contained process that uses only its own output as input for the next cycle's prediction.

This is the dynamic predictor on most CPUs. It predicts where to fetch from next cycle, and then based on that prediction it predicts where to fetch from the cycle after that, and so on. It doesn't use any information about the decoded instructions, but only past behavior of the branches. It does eventually get feedback from the execution units about the actual direction of the branch, and updates its predictions based on that, but this all happens essentially asynchronously, many cycles after the relevant instruction has passed through the predictor.

Adding It Up

All of this serves to chip away at the usefulness of static prediction.

First, the prediction comes too late, so even when working perfectly it implies a bubble of 6-8 cycles on modern Intel for taken branches (indeed, these are observed figures from so-called "front-end resteers" on Intel). This dramatically changes the cost/benefit equation for making a prediction at all. When you have a dynamic predictor before fetch making a prediction, you more-or-less want to make some prediction and if it has even 51% accuracy it will probably pay off.

For static predictions, however, you need to have high accuracy if you ever want to make a "taken" prediction. Consider, for example, an 8-cycle front-end resteer cost, versus a 16 cycle "full mispredict" cost. Let's say in some program that cold backwards branches are taken twice as often as not taken. This should be a win for static branch prediction that predicts backwards-taken, right (compared to a default strategy of always "predicting"2 not-taken)?

Not so fast! If you assume an 8-cycle re-steer cost and a 16-cycle full mispredict cost, they end up having the same blended cost of 10.67 cycles - because even in the correctly predicted taken case where is an 8 cycle bubble, but in the fall-through case there is no corresponding cost for the no-static-prediction case.

Add to that that the no-static-prediction case already gets the other half of static prediction correct (the forward-branches not-taken case), the utility of static prediction is not as large as one would imagine.

Why the change now? Perhaps because the front-end part of the pipeline has lengthened compared to the other parts, or because the increasing performance and memory of the dynamic predictors means that fewer cold branches are eligible for static prediction at all. Improving performance of static predictors also means that the backwards-taken prediction becomes less strong for cold branches, because loops (which are the reason for the backwards-taken rule) are more frequently remembered by the dynamic predictor.

Saving Dynamic Prediction Resources

The change could also be because of an interaction with dynamic prediction: one design for a dynamic predictor is not to use any branch prediction resources at all for a branch that is never observed to be taken. Since such branches are common, this can save a lot of history table and BTB space. However, such a scheme is inconsistent with a static predictor that predicts backwards branches as taken: if a backwards branch is never taken, you don't want the static predictor to pick up this branch, and predict it as taken and so messing up your strategy of saving resources for not-taken branches.


1 ... and also then do more stuff like retire, them - but what happens after execute mostly isn't important for our purposes here.

2 I put "predicting" in scare-quotes here because in a way it's not even predicting: not-taken is the default behavior of fetch and decode in the absence of any prediction to the contrary, so it's what you get if you don't put in any static prediction at all, and your dynamic predictor doesn't tell you otherwise.

8
votes

Static branch prediction as discussed in Section 3.4.1.3 of the Intel Optimization Manual is as follows:

  • Predict unconditional branches to be taken.
  • Predict conditional forward branches to be not taken.
  • Predict conditional backward branches to be taken.
  • Predict indirect branches to be not taken.

Compilers can organize the code accordingly. The same section says the following:

The Intel Core microarchitecture does not use the static prediction heuristic. However, to maintain consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.

This statement indicates that Section 3.4.1.3 has not been updated for many years.

If the dynamic predictor failed to predict that there is a branch instruction among the bytes fetched or if it suffers a miss in its buffers, then the fetch unit will just continue fetching sequentially because there is no other meaningful choice, effectively making a static prediction of Not Taken.

However, if it turns out, in the Instruction Queue Unit, that there is a conditional or indirect branch instruction in the fetched byte stream, then it would make sense at this point to make a static prediction that is potentially better than Not Taken. In particular, predicting conditional direct backward branches Taken. This can reduce the penalty of the failure of the dynamic predictor and the Not-Taken fetch unit, especially that the performance of the frontend is so critical. To my knowledge, there is no clear statement in the optimization manual that states that there is such static predictor at the IQU and that applies to modern processors. However, as I discuss in my other answer, the desciption of some performance counters seem to imply that there can be such static predictors at the IQU.

Overall, I think this is an implementation detail that Intel no longer documents.

Compiler-assisted dynamic branch prediction techniques do exist and can be very useful as you suggested, but they are not used in current Intel processors.

4
votes

My understanding is that with current designs, modern TAGE branch direction predictors always index to an entry, using the taken/not-taken history of recent branches. (This potentially spreads the state for a single branch out over a lot of internal state, making it possible to predict very complex patterns like a 10 element BubbleSort.)

The CPU doesn't try to detect aliasing and just uses the prediction it finds to decide taken/not-taken for conditional branches. i.e. branch-direction prediction is always dynamic, never static.

But a target prediction is still needed before the branch is even decoded to keep the front-end from stalling. The Branch Target Buffer is normally tagged, because the target of some other branch that aliased is unlikely to be useful.

As @Paul A Clayton points out, a BTB miss could let the CPU decide to use static prediction instead of whatever it found in the dynamic taken / not-taken predictor. We might just be seeing that it's much harder to make the dynamic predictor miss often enough to measure static prediction.

(I might be distorting things. Modern TAGE predictors can predict complex patterns for indirect branches too, so I'm not sure if they even try to predict in terms of taken/not-taken or if the first step is always just to try to predict the next address, whether or not that's the next instruction. Indexed branch overhead on X86 64 bit mode.)


Not-taken branches are still slightly cheaper in the correctly-predicted case, because the front-end can more easily fetch earlier and later instructions in the same cycle from the uop cache. (The uop cache in Sandybridge-family is not a trace cache; a uop-cache line can only cache uops from a contiguous block of x86 machine code.) In high-throughput code, taken branches could be a minor front-end bottleneck. They also typically spread the code out over more L1i and uop-cache lines.


For indirect branches, the "default" branch-target address is still next-instruction, so it can be useful to put a ud2 or something after a jmp rax to prevent mis-speculation (especially into non-code), if you can't simply put one of the real branch targets as the next instruction. (Especially the most common one.)


Branch prediction is kind of the "secret sauce" that CPU vendors don't publish details about.

Intel actually publishes instruction throughput / latency / execution-port info themselves (through IACA and some documents), but it's fairly straightforward to test experimentally (like https://agner.org/optimize/ and http://instlatx64.atw.hu/ have done) so it's not like Intel could keep that secret even if they wanted to.

Branch-prediction success rate is easy to measure with perf counters, but knowing why one specific branch was mispredicted or not on one specific execution is very hard; even measuring is hard for a single execution of one branch, unless you instrument your code with rdtsc or rdpmc or something.