How exactly do kernel virtual addresses get translated to physical RAM?

Question

On the surface, this appears to be a silly question. Some patience please.. :-) Am structuring this qs into 2 parts:

Part 1: I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).

Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.

HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:

char *kptr = kmalloc(1024, GFP_KERNEL);

Now kptr is a kernel virtual address. I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address. But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU). Is this actually the case??

Part 2: Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.

(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).

If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?

Eg. On an x86_64, 2 processes A and B are alive, 1 cpu. A is running, so it's higher-canonical addr 0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr 0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.

Now, if we context-switch A->B, process B's lower-canonical region is unique But it must "map" to the same kernel of course! How exactly does this happen? How do we "auto" refer to the kernel paging table when in kernel mode? Or is that a wrong statement?

Thanks for your patience, would really appreciate a well thought out answer!

Stuart Menefy Stuart Menefy · Accepted Answer · 2016-04-26T17:53:08

First a bit of background.

This is an area where there is a lot of potential variation between architectures, however the original poster has indicated he is mainly interested in x86 and ARM, which share several characteristics:

no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)

So if we restrict ourselves to those systems it keeps things simpler.

Once the MMU is enabled, it is never normally turned off. So all CPU addresses are virtual, and will be translated to physical addresses using the MMU. The MMU will first look up the virtual address in the TLB, and only if it doesn't find it in the TLB will it refer to the page table - the TLB is a cache of the page table - and so we can ignore the TLB for this discussion.

The page table describes the entire virtual 32 or 64 bit address space, and includes information like:

whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use

Linux divides the virtual address space into two: the lower portion is used for user processes, and there is a different virtual to physical mapping for each process. The upper portion is used for the kernel, and the mapping is the same even when switching between different user processes. This keep things simple, as an address is unambiguously in user or kernel space, the page table doesn't need to be changed when entering or leaving the kernel, and the kernel can simply dereference pointers into user space for the current user process. Typically on 32bit processors the split is 3G user/1G kernel, although this can vary. Pages for the kernel portion of the address space will be marked as accessible only when the processor is in kernel mode to prevent them being accessible to user processes. The portion of the kernel address space which is identity mapped to RAM (kernel logical addresses) will be mapped using big pages when possible, which may allow the page table to be smaller but more importantly reduces the number of TLB misses.

When the kernel starts it creates a single page table for itself (swapper_pg_dir) which just describes the kernel portion of the virtual address space and with no mappings for the user portion of the address space. Then every time a user process is created a new page table will be generated for that process, the portion which describes kernel memory will be the same in each of these page tables. This could be done by copying all of the relevant portion of swapper_pg_dir, but because page tables are normally a tree structures, the kernel is frequently able to graft the portion of the tree which describes the kernel address space from swapper_pg_dir into the page tables for each user process by just copying a few entries in the upper layer of the page table structure. As well as being more efficient in memory (and possibly cache) usage, it makes it easier to keep the mappings consistent. This is one of the reasons why the split between kernel and user virtual address spaces can only occur at certain addresses.

To see how this is done for a particular architecture look at the implementation of pgd_alloc(). For example ARM (arch/arm/mm/pgd.c) uses:

pgd_t *pgd_alloc(struct mm_struct *mm)
{
    ...
    init_pgd = pgd_offset_k(0);
    memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
               (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
    ...
}

or x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():

static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
    /* If the pgd points to a shared pagetable level (either the
       ptes in non-PAE, or shared PMD in PAE), then just copy the
       references from swapper_pg_dir. */
        ...
        clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
                swapper_pg_dir + KERNEL_PGD_BOUNDARY,
                KERNEL_PGD_PTRS);
    ...
}

So, back to the original questions:

Part 1: Are kernel virtual addresses really translated by the TLB/MMU?

Yes.

Part 2: How is swapper_pg_dir "attached" to a user mode process.

All page tables (whether swapper_pg_dir or those for user processes) have the same mappings for the portion used for kernel virtual addresses. So as the kernel context switches between user processes, changing the current page table, the mappings for the kernel portion of the address space remain the same.

How exactly do kernel virtual addresses get translated to physical RAM?

2 Answers