Two different user processes have different virtual address spaces. Because the virtual↔physical address mappings are different, the TLB cache is invalidated when switching contexts from one user process to another. This is very expensive, as without the address already cached in the TLB, any memory access will result in a fault and a walk of the PTEs.
Syscalls involve two context switches: user→kernel, and then kernel→user. In order to speed this up, it is common to reserve the top 1GB or 2GB of virtual address space for kernel use. Because the virtual address space does not change across these context switches, no TLB flushes are necessary. This is enabled by a user/supervisor bit in each PTE, which ensures that kernel memory is only accessible while in the kernelspace; userspace has no access even though the page table is the same.
If there were hardware support for two separate TLBs, with one exclusively for kernel use, then this optimization would no longer be useful. However, if you have enough space to dedicate, it's probably more worthwhile to just make one larger TLB.
Linux on x86 once supported a mode known as "4G/4G split". In this mode, userspace has full access to the entire 4GB virtual address space, and the kernel also has a full 4GB virtual address space. The cost, as mentioned above, is that every syscall requires a TLB flush, along with more complex routines to copy data between user and kernel memory. This has been measured to impose up to a 30% performance penalty.
Times have changed since this question was originally asked and answered: 64-bit operating systems are now much more prevalent. In current OSes on x86-64, virtual addresses from 0 to 247-1 (0-128TB) are allowed for user programs while the kernel permanently resides within virtual addresses from 247×(217-1) to 264-1 (or from -247 to -1, if you treat addresses as signed integers).
What happens if you run a 32-bit executable on 64-bit Windows? You would think that all virtual addresses from 0 to 232 (0-4GB) would easily be available, but in order to avoid exposing bugs in existing programs, 32-bit executables are still limited to 0-2GB unless they are recompiled with /LARGEADDRESSAWARE
. For those that are, they get access to 0-4GB. (This is not a new flag; the same applied in 32-bit Windows kernels running with the /3GB
switch, which changed the default 2G/2G user/kernel split to 3G/1G, although of course 3-4GB would still be out of range.)
What sorts of bugs might there be? As an example, suppose you are implementing quicksort and have two pointers, a
and b
pointing at the start and past the end of an array. If you choose the middle as the pivot with (a+b)/2
, it'll work as long as both the addresses are below 2GB, but if they are both above, then the addition will encounter integer overflow and the result will be outside the array. (The correct expression is a+(b-a)/2
.)
As an aside, 32-bit Linux, with its default 3G/1G user/kernel split, has historically run programs with their stack located in the 2-3GB range, so any such programming errors would likely have be flushed out quickly. 64-bit Linux gives 32-bit programs access to 0-4GB.