The following code fragment creates a function (fun) with just one RET instruction. The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
*ins = RET;
}
return 0;
}
The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
805,516,067 L1-icache-load-misses
4,857 iTLB-load-misses
32.052301220 seconds time elapsed
Now, look at the same code without overwriting the RET instruction.
#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>
typedef void (*foo)();
#define RET (0xC3)
int main(){
// Allocate an executable page
char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
// Just write a RET instruction
*ins = RET;
// make fun point to the function with just RET instruction
foo fun = (foo)(ins);
// Repeat 0xfffffff times
for(long i = 0; i < 0xfffffff; i++){
fun();
// Commented *ins = RET;
}
return 0;
}
And here is the perf statistics on the same machine.
perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out
Performance counter stats for './a.out':
11,738 L1-icache-load-misses
425 iTLB-load-misses
0.773433500 seconds time elapsed
Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth. Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses. The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!
It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?
Furthermore, why is there a 10x growth in the iTLB misses?
machine_clears.smc
even withmfence
+lfence
after the store on Skylake. I was hoping that would stop speculation into the code in another page until after the store could evict the uop-cache and L1i entries. – Peter Cordes