TLB invlpg instruction has long latency

Question

So I'm working on this kernel module that does some page table manipulation and I noticed that flushing a TLB entry is slow. How slow you ask? Over 100 ns per call to invlpg! That's 280 cycles or more. I'm willing to accept this...but for hardware supported paging and address translation this seems counterintuitive. Anyone know why this is so bad?

I'm running on a 4 core 2.8 Ghz Intel core i5

In early pentiums (1993) it was faster: intel-assembler.it/portale/5/intel-pentium-instruction/… : "INVLPG ... 25 Clock Cycles". Probably it is microcoded lkml.org/lkml/2008/1/25/607 — osgx
Also, it is serializing instruction (sandpile.org/x86/coherent.htm), so it will block entire pipeline (20+ stages) and all reordering HW. This means, that all instructions before will be executed and their results stored (keeps no in-flight in store buffers); and no one of next instruction will be planned. You can compare its speed with CPUID serializing instruction. There also were theories, that invlpg needs to do a TLB-table walk (entry-by-entry); or there is lot of special handling of large pages (superpages of 2-4MB and sometimes 1GB). — osgx
Timoteo: May I ask on how you performed those measurements? @osgx: I guess halting the pipeline is essential when changing any address translation? Do you have references for those theories? — Daniel Jour

Peter Cordes Peter Cordes · Accepted Answer · 2015-06-26T03:39:14

My guess is that privileged instructions like this are rarely a significant part of the total CPU time of any real workload, so it's not worth spending the amount of silicon it would take to make them faster.

Making them non-serializing would mean the out-of-order uop scheduling logic would have to track page-table modifications as one of the dependencies for every memory uop. This would have negative consequences for power consumption, since the re-order buffer already needs to track a lot of stuff, and support 4 inputs and more-than-that outputs per cycle.

The widespread use of virtualization has led to performance improvements in those instructions in recent designs, since virt overhead is an issue in some workloads. I guess this isn't the case for invlpg.

TLB invlpg instruction has long latency

1 Answers