1
votes

Why is branch prediction accurate? Can we generally think of it at a high level in terms of how certain branches of our code execute 99% of time, while the rest is special cases and exception handling?

My question my be a little vague but I am only interested in high level view on this. Let me give you an example

Say you have a function with a parameter

void execute(Input param) { 
  assertNotEmpty(param)
  (...)
}

I execute my function conditionally given parameter isn't empty. 99% of times this parameter will indeed be non empty. Can I then think of neural network based branch prediction for example, in a way, that as it has seen such instruction flow countless times (such assertions are quite common), it will simply learn that most of the time that parameter is non empty and take branch accordingly?

Can we then think of our code in terms of - the cleaner, the more predictable it is, or even more common - the easier we make it for branch predictor?

Thanks!

3

3 Answers

2
votes

There are couple of reasons that allow us to develop good branch predictors:

  1. Bi-modal distribution - the outcome of branches is often bimodally distributed, i.e. an individual branch is often highly biased towards taken or untaken. If the distribution of most branches would be uniform then it'd be impossible to devise a good prediction algorithm.

  2. Dependency between branches - in real-world programs, there is a significant amount of dependency between distinct branches, that is the outcome of one branch affects the outcome of another branch. For example:

    if (var1 == 3)     // b1
        var1 = 0;
    if (var2 == 3)     // b2
        var2 = 0;
    if (var1 != var2)  // b3
        ...
    

    The outcome of branch b3 here depends on the outcome of branches b1 and b2. If both b1 and b2 are untaken (that is their conditions evaluate to true and var1 and var2 are assigned 0) then branch b3 will be taken. The predictor that looks at a single branch only has no way to capture this behavior. Algorithms that examine this inter-branch behavior are called two-level predictors.

You didn't ask for any particular algorithms so I won't describe any of them, but I'll mention the 2-bit prediction buffer scheme that works reasonably well and is quite simple to implement (essentially, one keeps track of outcomes of a particular branch in a cache and makes decision based on the current state in the cache). This scheme was implemented in the MIPS R10000 processor and the results showed prediction accuracy of ~90%.

I'm not sure about application of NNs to branch-prediction - it does seem possible to design an algorithm based on NNs. However, I believe it wouldn't have any practical usage as: a) it would be too complex to implement in hardware (so it'd take too many gates and introduce a lot of delay); b) it wouldn't have significant improvement on predictor's performance compared to traditional algorithms that are much easier to implement.

2
votes

A short history of how branches are predicted:

When Great-Granny was programmingGreat-Granny was programming

there was no prediction and no pre-fetch, soon she started pre-fetching the next instruction while executing the current instruction. Most of the times this was correct and improved the clock per instruction in most cases by one and otherwise nothing was lost. This already had a misprediction rate of only average 34% (59%-9%, H&P AQA p.81).

When Granny was programming Granny was programming

There was the problem that the CPU's were getting faster and added a Decoding stage to the pipeline, making it Fetch -> Decode -> Execute -> Write back. With 5 instructions between branches 2 fetches were lost every 5 instructions if the branch was backward or forward and was respectively taken and not taken. A quick research showed that most conditional backward branches were loops and most were taken and most forward was not taken, as they mostly were bad cases. With profiling we get down to 3%-24%

The advent of the dynamic branch predictor with the saturation counterthe saturation counter

made life for the programmerprogrammer easier. From the observation that most branches do what they did last time, having a list of counters address with the low bits of the address of a branch told if the branch was taken or not and the Branch Target Buffer provided the address to be fetched. On this local predictor it lowers the mis-prediction rate to 1%-18%.

This is all good and fine, but some branches are depended on how previous other branches acted. So if we have a history of the last branches take or not taken as 1 and 0 we have 2^H different predictors depending on the history. In practice the history bits are xor'ed with the branch lower address bits, using the same array as in the previous version.

The PRO of this is that the predictor can quickly learn patterns, the CON is if there is no pattern the branch will overwrite the previous branches bits. The PRO outweighs the CON as the locality is more important than branches that are not in the current (inner) loop. This global predictor improve the mis-prediction down to 1%-11%.

That is great, but in some cases the local predictor beats the global predictor so we want both. Xor-ing the local branch history with the address improves on the local branch prediction making it a 2 level predictor as well, just with local instead of global branch history. Adding a 3rd saturation counter for each branch that counts which was right we can select between them. This tournament predictor improves the misprediction rate with around 1% point compared with the global predictor.

Now your case is one in 100 branches in another direction.

Lets examine the local two level predictor, when we get to the one case the last H branches of this branches have all been in the same direction, lets say taken, making all history 1's so the branch predictor will have chosen a single entry in the local predictor table and it will be saturated to taken. This means it will in all cases case an mis-predict on the one case, and the next call where the branch will be taken will most likely be correctly predicted (barring aliasing to the branch table entry). So the local branch predictor can't be used as having a 100 bit long history would require a 2^100 large predictor.

Maybe the global predictor catch the case then, in the last 99 cases the branch was taken, so the predictors for the last 99 will have updated according to the different behaviour of the last H branches moving them to predict taken. So if the last H branches have independent behaviour from the current branch, then all the entries in the global branch prediction table will predict taken and so you will get a mis-predict.

But if a combination of previous branches, say the 3rd, 7th and 12th, all acted so that if the right combination of these were taken/not taken it would foreshadow the opposite behavior, the branch prediction entry of this combination would correctly predict the behaviour of the branch. The problem here is that if you only seldom, seen in the runtime over the program, updates this branch entry and other branches alias to it with their behaviour then it might fail to predict anyway.

Let assume the global branch behaviour actually predicts the right outcome based on the pattern of previous branches. Then you will most likely be mislead by the tournament predictor which says the local predictor is "always" right and the local predictor will always mis-predict for your case.

Note 1: The "always" should be taken with a small grain of sand, as other branches might pollute your branch table entries with aliasing to the same entry. The designers have tried to make this less likely with having 8K different entries, creatively rearranging the bits of the lower address of the branch.

Note 2: Other schemes might be able to solve this but unlikely as its 1 in 100.

0
votes

Many languages provides mechanisms to tell the compiler thich branch is most expected result. It helps the compiler to organise the code to maximise positive branch predictions. An example gcc __builtin_expect, likely, unlikely