Why modifying an instruction cause huge i-cache and i-TLB misses on x86?

Question

The following code fragment creates a function (fun) with just one RET instruction. The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
     // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        *ins = RET;
    }
    return 0;
}

The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

   805,516,067      L1-icache-load-misses                                       
         4,857      iTLB-load-misses                                            

  32.052301220 seconds time elapsed

Now, look at the same code without overwriting the RET instruction.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
    // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        // Commented *ins = RET;
    }
    return 0;
}

And here is the perf statistics on the same machine.

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

        11,738      L1-icache-load-misses                                       
           425      iTLB-load-misses                                            

   0.773433500 seconds time elapsed

Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth. Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses. The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!

It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

Furthermore, why is there a 10x growth in the iTLB misses?

Stores don't go into the I-L1, so when the CPU detects SMC, it invalidates the L1 line, flush the pipeline and restart the fetching, causing a miss. At least that's what I believe. The iTLB count may be due to some aliasing-avoid mechanism since there is also an identical dTLB entry. But again, I don't know for sure. — Margaret Bloom
To learn more about how real Intel CPUs handle self-modifying code (with a pipeline nuke), see Observing stale instruction fetching on x86 with self-modifying code. @MargaretBloom: I tested, and we do get counts for machine_clears.smc even with mfence + lfence after the store on Skylake. I was hoping that would stop speculation into the code in another page until after the store could evict the uop-cache and L1i entries. — Peter Cordes

Zulan Zulan · Accepted Answer · 2018-09-16T17:31:31

Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

No.

If you want to modify code - the only path this can go is the following:

Store Date Execution Engine
Store Buffer & Forwarding
L1 Data Cache
Unified L2 Cache
L1 Instruction Cache

Note that you are also missing out on the μOP Cache.

This is illustrated by this diagram¹, which I believe is sufficiently accurate.

I would suspect the iTLB misses could be due to regular TLB flushes. In case of no modification you are not affected by iTLB misses because your instructions actually come from the μOP Cache.

If they don't, I'm not quite sure. I would think the L1 Instruction Cache is virtually addressed, so no need to access the TLB if there is a hit.

^{1: unfortunately the image has a very restrictive copyright, so I refrain from highlighting the path / inlining the image.}

Why modifying an instruction cause huge i-cache and i-TLB misses on x86?

1 Answers