Sometimes the term "BTB" is used collectively to refer to all of the buffers used by the branch prediction unit. However, there are actually multiple buffers all of which are used in every cycle to make target and direction predictions. In particular, the BTB is used to make predictions for direct branches, the ITB (indirect target buffer) is used to make predictions for indirect branches except for returns, and the RSB is used to make predictions for returns. The ITB is also called the IBTB or Indirect Target Array. All of these terms are used by different vendors and researchers. Typically, the BTB is used to make initial predictions for all kinds of branch instructions when the other buffers miss. But later the predictor learns more about the branches and the other buffers come into play. If multiple dynamic instances of the same indirect branch have all the same target, then the BTB might also be used instead of the ITB. The ITB is much more accurate when the same branch has multiple targets and is designed specifically to deal with such branches. See: Branch prediction and the performance of interpreters — Don't trust folklore. The first Intel processor that implemented separate BTB and ITB structures is the Pentium M. All later Intel Core processors have dedicated ITBs.
The Spectre V1 exploit is based on training the BTB using an attacker program so that when the victim executes a branch that aliases the same BTB entry, the processor is tricked into speculatively executing instructions (called the gadget) to leak information. The Spectre V2 exploit is similar but is based on training the ITB instead. The crucial difference here is that in V1, the processor mispredicts the direction of the branch, while in V2, the processor mispredicts the target of the branch (and, in case of a conditional indirect branch, the direction as well because we want it to be taken). In programs that are interpreted, JIT-compiled, or make use of dynamic polymorphism, there can be many indirect branches (other than returns). A particular indirect branch may never be intended to go to some location, but by mistraining the predictor, it can be made to jump anywhere we want. It is exactly for this reason why V2 is very powerful; no matter where the gadget is and no matter what the intentional control flows of the program are, you can pick one of the indirect branches and make it speculatively jump to the gadget.
Note that typically the linear address of the target of a static direct branch remains the same throughout the lifetime of the program. There is only one situation where this may not be the case: dynamic code modification. So at least in theory, a Spectre exploit can be developed based on target misprediction of direct branches.
Regarding reclamation of LFBs, I don't really understand what you're saying. When a load request that missed the L1D receives the data into the LFB, the data is immediately forwarded to the bypass interconnect of the pipeline. There needs to be a way to determine which load uop has requested this data. The data returned must be tagged with the uop ID of the load. The sources of the uops in the RS that are waiting for the data are represented as the uop IDs of the loads. In addition, the ROB entry that holds the load uop needs to be marked as completed so that it can be retired and, in pre-SnB, the returned data needs to be written into the ROB. If, on pipeline flush, an outstanding load request in an LFB is not cancelled, and if the load uop ID got reused for some other uop, when the data arrives, it might be incorrectly forwarded to whatever new uops are currently in the pipeline, thereby corrupting the microarchitectural state. So there needs to be a way to ensure that this does not happen under any circumstances. It's very possible to cancel outstanding load requests and speculative RFOs on a pipeline flush by simple marking all of the valid LFB entries as "cancelled", just so that the data is not returned to the pipeline. However, the data might still be fetched and filled in into one or more levels of caches. Requests in the LFB are identified by line-aligned physical addresses. There can be other possible designs.
I've decided to run an experiment to determine exactly when the LFBs get deallocated on Haswell. Here is how it works:
Outer Loop (10K iterations):
Inner Loop (100 iterations):
10 load instructions to different cache lines most of which miss the L2.
LFENCE.
A sequence of IMULs to delay the resolution of the jump by 18 cycles.
Jump to inner.
3 load instructions to different cache lines.
LFENCE.
Jump to outer.
For this to work, hyperthreading and both L1 prefetchers need to be turned off to ensure that we own all of the 10 LFBs of the L1.
The LFENCE
instructions ensure that we don't run out of LFBs when executing on a correctly predicted path. The key idea here is that the inner jump will be mispredicted once per outer iteration, so up to 10 loads of the inner iteration that are on the mispredicted path can be allocated in the LFBs. Note that the LFENCE
prevents loads from later iterations to be allocated. After a few cycles, the inner branch will be resolved and a misprediction occurs. The pipeline is cleared and the frontend is resteered to fetch and execute the load instructions in the outer loop.
There are two possible outcomes:
- The LFBs that have been allocated for the loads on the mispredicted path are immediately released as part of the pipeline clear operation and made available for other loads. In this case, there will be no stalls due to LFB unavailability (counted using
L1D_PEND_MISS.FB_FULL
).
- The LFBs are released only when the loads get serviced irrespective of whether they were on a mispredicted path.
When there are three loads in the outer loop after the inner jump, the measured value of L1D_PEND_MISS.FB_FULL
is about equal to the number of outer iterations. That's one request per outer loop iteration. This means that when the three loads on the correct path get issued to the L1D, the loads from the mispredcited path are still occupying the 8 LFB entries, resulting in an FB full event for the third load. This suggests that loads in the LFBs only get deallcoated when the load actually completes.
If I put less than two loads in the outer loop, there will be basically no FB full events. There is one thing I noticed: for every additional load in the outer loop beyond three loads, the L1D_PEND_MISS.FB_FULL
gets increased by about 20K instead of the expected 10K. I think what's happening is that when a load request of a load uop gets issued to the L1D for the first time and the all LFBs are in use, it gets rejected. Then when an LFB becomes available, two loads pending in the load buffer are sent to the L1D, one will be allocated in the LFB and the other will get rejected. So we get two LFB full events per additional load. However, when there are three loads in the outer loop, only the third one would be waiting for an LFB, so we get one event per outer loop iteration. Essentially, the load buffer cannot distinguish between having one LFB available or two LFBs; it only gets to know that at least one LFB is free and so it attempts to send two load requests at the same time since there are two load ports.