I am writing an application that makes heavy use of mmap
, including from distinct processes (not concurrently, but serially). A big determinant of performance is how the TLB is managed user and kernel side for such mappings.
I understand reasonably well the user-visible aspects of the Linux page cache. I think this understanding extends to the userland performance impacts1.
What I don't understand is how those same pages are mapped into kernel space, and how this interacts with the TLB (on x86-64). You can find lots of information on how this worked in the 32-bit x86 world2, but I didn't dig up the answer for 64-bit.
So the two questions are (both interrelated and probably answered in one shot):
- How is the page cache mapped3 in kernel space on x86-64?
- If you
read()
N pages from a file in some process, then again read exactly those N pages again from another process on the same CPU, it possible that all the kernel side reads (during the kernel -> userpace copy of the contents) hit in the TLB? Note that this is (probably) a direct consequence of (1).
My overall goal here is to understand at a deep level the performance difference of one-off accessing of cached files via mmap
or non-mmap
calls such as read
.
1 For example, if you mmap
a file into your processes' virtual address space, you have effectively asked for your process page tables to contain a mapping from the returned/requested virual address range to a physical range corresponding to the pages for that file in the page cache (even if they don't exist in the page cache, yet). If MAP_POPULATE
is specified, all the page table entries will actually be populated before the mmap
call returns, and if not they will be populated as you fault-in the associated pages (sometimes with optimizations such as fault-around).
2 Basically, (for 3:1 mappings anyway) Linux uses a single 1 GB page to map approximately the first 1 GB of physical RAM directly (and places it at the top 1 GB of virtual memory), which is the end of story for machines with <= 1 GB RAM (the page cache necessarily goes in that 1GB mapping and hence a single 1 GB TLB entry covers everything). With more than 1GB RAM, the page cache is preferentially allocated from "HIGHMEM" - the region above 1GB which isn't covered by the kernel's 1GB mapping, so various temporary mapping strategies are used.
3 By mapped I mean how are the page tables set up for its access, aka how does the virtual <-> physical mapping work.
mmap
function. I already read the help guide and it is explicit on-topic here:software tools commonly used by programmers
. Furthermore, there are a ton of great similar questions with great answers here (yes I get that just because other off topic questions exist doesn't mean you can ask more, but this isn't off topic, so don't worrk). – BeeOnRoperead()
in 128k or 64k chunks and then modify in-place for that block that's still hot in L2. I don't have perf counters on the Haswell-EP VM this runs on, so it's annoying to experiment. Since mmap of files can't use hugepages,read()
might beatmmap(MAP_POPULATE)
if the kernel is memcpying from 1G or 2M pages. – Peter Cordesmmap
specifically, I ended up just testing it. The one-off linear read should be the best possible case forread()
andwrite()
but I still foundmmap()
faster, but only by say 10%-40%. When reading large files (say 100MB+), I could get around 10 GB/s frommmap()
versus amemcpy
bandwidth of ~13 GB/s, so it is pretty close to the max already. For files that fit in L3, the gap was much larger. I put a lot of details on the tradeoffs I found here. @PeterCordes – BeeOnRope