3
votes

I am writing an application that makes heavy use of mmap, including from distinct processes (not concurrently, but serially). A big determinant of performance is how the TLB is managed user and kernel side for such mappings.

I understand reasonably well the user-visible aspects of the Linux page cache. I think this understanding extends to the userland performance impacts1.

What I don't understand is how those same pages are mapped into kernel space, and how this interacts with the TLB (on x86-64). You can find lots of information on how this worked in the 32-bit x86 world2, but I didn't dig up the answer for 64-bit.

So the two questions are (both interrelated and probably answered in one shot):

  1. How is the page cache mapped3 in kernel space on x86-64?
  2. If you read() N pages from a file in some process, then again read exactly those N pages again from another process on the same CPU, it possible that all the kernel side reads (during the kernel -> userpace copy of the contents) hit in the TLB? Note that this is (probably) a direct consequence of (1).

My overall goal here is to understand at a deep level the performance difference of one-off accessing of cached files via mmap or non-mmap calls such as read.


1 For example, if you mmap a file into your processes' virtual address space, you have effectively asked for your process page tables to contain a mapping from the returned/requested virual address range to a physical range corresponding to the pages for that file in the page cache (even if they don't exist in the page cache, yet). If MAP_POPULATE is specified, all the page table entries will actually be populated before the mmap call returns, and if not they will be populated as you fault-in the associated pages (sometimes with optimizations such as fault-around).

2 Basically, (for 3:1 mappings anyway) Linux uses a single 1 GB page to map approximately the first 1 GB of physical RAM directly (and places it at the top 1 GB of virtual memory), which is the end of story for machines with <= 1 GB RAM (the page cache necessarily goes in that 1GB mapping and hence a single 1 GB TLB entry covers everything). With more than 1GB RAM, the page cache is preferentially allocated from "HIGHMEM" - the region above 1GB which isn't covered by the kernel's 1GB mapping, so various temporary mapping strategies are used.

3 By mapped I mean how are the page tables set up for its access, aka how does the virtual <-> physical mapping work.

1
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask. Also see Where do I post questions about Dev Ops?jww
@jww - thanks for the note, but it's a programming and development question - not only is the answer itself about programming (how is the kernel programmed), it is actually directly related to programming I'm doing targeting the mmap function. I already read the help guide and it is explicit on-topic here: software tools commonly used by programmers. Furthermore, there are a ton of great similar questions with great answers here (yes I get that just because other off topic questions exist doesn't mean you can ask more, but this isn't off topic, so don't worrk).BeeOnRope
Finally, if reasonable minds could disagree (and they can here) - just leave it be. I don't know if you've asked these types of development-oriented questions on SU or Unix SE, but you don't get good answers. Those communities are largely for expert users of such systems, not developers. ... and dev ops? I think you misunderstood the thrust of my question.BeeOnRope
Did you ever solve this? I'm optimizing an application that wants to sequential-read a ~100MB file of big-endian float32 into its own large buffer before further processing. I'm trying to figure out whether it's better to mmap and copy+vpshufb on the fly, or whether to read() in 128k or 64k chunks and then modify in-place for that block that's still hot in L2. I don't have perf counters on the Haswell-EP VM this runs on, so it's annoying to experiment. Since mmap of files can't use hugepages, read() might beat mmap(MAP_POPULATE) if the kernel is memcpying from 1G or 2M pages.Peter Cordes
On mmap specifically, I ended up just testing it. The one-off linear read should be the best possible case for read() and write() but I still found mmap() faster, but only by say 10%-40%. When reading large files (say 100MB+), I could get around 10 GB/s from mmap() versus a memcpy bandwidth of ~13 GB/s, so it is pretty close to the max already. For files that fit in L3, the gap was much larger. I put a lot of details on the tradeoffs I found here. @PeterCordesBeeOnRope

1 Answers

1
votes

Due to vast virtual address space compared to physical ram installed (128TB for the kernel), the common trick is to permanently map all the ram. This is known as "direct map".

In principle it is possible that both relevant TLB and cache entries survive the context switch and all the other code executed, but it is hard to say how likely this can be in the real world.