I am trying to understand when branch predictor entries are invalidated.
Here are the experiments I have done:
Code1:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment
to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches
are invalidated. I am using nokaslr
and I have disabled ASLR
. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).
My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?
The second experiment I have done is:
Code 2:
do:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
while(cpu core == 1)
In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1
so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.
As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.
EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here
To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment
The code:
#include "linux-perf-events.h"
#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>
// some array
int arr8[8] = {1,1,0,0,0,1,0,1};
int pin_thread_to_core(int core_id){
int retval;
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores)
retval = EINVAL;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
return retval;
}
void measurement(int cpuid, uint64_t howmany, int* branch_misses){
int retval = pin_thread_to_core(cpuid);
if(retval){
printf("Affinity error: %s\n", strerror(errno));
return;
}
std::vector<int> evts;
evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!
LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid
uint64_t *buffer = new uint64_t[howmany + 1];
uint64_t *buffer_org; // for restoring
buffer_org = buffer;
uint64_t howmany_org = howmany; // for restoring
std::vector<unsigned long long> results;
results.resize(evts.size());
do{
for(size_t trial = 0; trial < 10; trial++) {
unified.start();
// the while loop will be executed innerloop times
int res;
while(howmany){
res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
if(res){
*buffer++ = res;
}
howmany--;
}
unified.end(results);
// store misses
branch_misses[trial] = results[0];
// restore for next iteration
buffer = buffer_org;
howmany = howmany_org;
}
}while(cpuid == 5); // the core that does busy loop
// get rid of optimization
howmany = (howmany + 1) * buffer[3];
branch_misses[10] = howmany; // last entry is reserved for this dummy operation
delete[] buffer;
}
void usage(){
printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
// as I have 11th core isolated, set affinity to that
if(argc == 1){
usage();
return 1;
}
int exp = 16; // howmany
int results[11];
int cpuid = atoi(argv[1]);
measurement(cpuid, exp, results);
printf("%d measurements\n", exp);
printf("Trial\t\t\tBranchMiss\n");
for (size_t trial = 0; trial < 10; trial++)
{
printf("%zu\t\t\t%d\n", trial, results[trial]);
}
return 0;
}
If you want to try the first code, just run ./experiment 1
twice. It will have the same execution as the first code.
If you want to try the second code, open two terminals, run ./experiment X
in the first one, and run ./experiment Y
in the second one, where X and Y are cpuid's.
Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.
const int cpu = -1;
to a different core) – yzb74714