How to use Intel Westmere 1GB pages on Linux?

Question

Edit: I updated my question with the details of my benchmark

For benchmarking purposes, I am trying to setup 1GB pages in a Linux 3.13 system running on top of two Intel Xeon 56xx ("Westmere") processors. For that I modified my boot parameters to add support for 1GB pages (10 pages). These boot parameters only contain 1GB pages and not 2MB ones. Running hugeadm --pool-list leads to:

      Size  Minimum  Current  Maximum  Default
1073741824       10       10       10        *

My kernel boot parameters are taken into account. In my benchmark I am allocating 1GiB of memory that I want to be backed by a 1GiB huge page using:

#define PROTECTION (PROT_READ | PROT_WRITE)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
uint64_t size = 1UL*1024*1024*1024;
memory = mmap(0, size, PROTECTION, FLAGS, 0, 0);
if (memory == MAP_FAILED) {
    perror("mmap");
    exit(1);
}
sleep(200)

Looking at the /proc/meminfo while the bench is sleeping (sleep call above), we can see that one huge page has been allocated:

AnonHugePages:      4096 kB
HugePages_Total:      10
HugePages_Free:        9
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB

Note: I disabled THP (through the /sys file system) before running the bench, so I guess the AnonHugePages field reported by /proc/meminfo represents the huge pages allocated by THP before stopping it.

At this point we can think that all is fine, but unfortunately my bench leads me to think that many 2MiB pages are used and not one 1GiB page. Here is the explanation:

This bench randomly access the allocated memory through pointer's chasing: a first step fills the memory to enable pointer chasing (each cell points to another cell) and in a second step the bench navigates through the memory using

pointer = *pointer;

Using the perf_event_open system call, I am counting data TLB read misses for the second step of the bench only. When the memory allocated size is 64MiB, I count a very small number, 0,01 % of my 6400000 memory accesses, of data TLB read misses. All the accesses are saved in the TLB. In other words, 64MiB of memory can be kept in the TLB. As soon as the allocated memory size is greater than 64 MiB I see data tlb read misses. For a memory size equals to 128 MiB, I have 50% of my 6400000 memory accesses that missed in the TLB. 64MiB appears to be the size that can fit in the TLB and 64MiB = 32 entries (as reportd below) * 2MiB pages. I conclude that I am not using 1GiB pages but 2MiB ones.

Can you see any explanation for that behavior ?

Moreover, the cpuid tool, reports the following about the tlb on my system:

   cache and TLB information (2):
      0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x55: instruction TLB: 2M/4M pages, fully, 7 entries
      0xb0: instruction TLB: 4K, 4-way, 128 entries
      0xca: L2 TLB: 4K, 4-way, 512 entries
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):

As you can see, there is no information about 1GiB pages. How many such pages can be cached in the TLB ?

@IwillnotexistIdonotexist github.com/ManuelSelva/c4fun/blob/master/mem_load/mem_load.c — Manuel Selva
The way I see it then, on Westmere, the only advantage to having 1GB pages, is to reduce the amount of memory needed for the page tables themselves. Specifically, Linux x86_64's direct mapping of all physical memory, and I suppose any userspace program crazy enough to mmap multiples of 1GB. — Jonathon Reinhart

Iwillnotexist Idonotexist Iwillnotexist Idonotexist · Accepted Answer · 2015-01-19T22:55:12

TL;DR

You (specifically, your processor) cannot benefit from 1GB pages in this scenario, but your code is correct without modifications on systems that can.

Long version

I followed these steps to attempt to reproduce your problem.

My System: Intel Core i7-4700MQ, 32GB RAM 1600MHz, Chipset H87

svn co https://github.com/ManuelSelva/c4fun.git
cd c4fun.git/trunk
make. Discovered a few dependencies were needed. Installed them. Build failed, but mem_load did build and link, so did not pursue the rest further.
Rebooted the system, appending at GRUB time to the boot arguments the following:
```
 hugepagesz=1G hugepages=10 default_hugepagesz=1G
```
which reserves 10 1GB pages.
cd c4fun.git/trunk/mem_load
Ran several tests using memload, in random-access pattern mode and pinning it to core 3, which is something that isn't 0 (the bootstrap processor).
- ./mem_load -a rand -c 3 -m 1073741824 -i 1048576
  
  This resulted in approximately nil TLB misses.
- ./mem_load -a rand -c 3 -m 10737418240 -i 1048576
  
  This resulted in approximately 60% TLB misses. On a hunch I did
- ./mem_load -a rand -c 3 -m 4294967296 -i 1048576
  
  This resulted in approximately nil TLB misses. On a hunch I did
- ./mem_load -a rand -c 3 -m 5368709120 -i 1048576
  
  This resulted in approximately 20% TLB misses.

At this point I downloaded the cpuid utility. It gave me this for cpuid -1 | grep -i tlb:

   cache and TLB information (2):
      0x63: data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x76: instruction TLB: 2M/4M pages, fully, 8 entries
      0xb5: instruction TLB: 4K, 8-way, 64 entries
      0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):

As you can see, my TLB has 4 entries for 1GB pages. This explains well my results: For 1GB and 4GB arenas, the 4 slots of the TLB are entirely sufficient to satisfy all accesses. For 5GB arenas and random-access pattern mode, 4 of the 5 pages only can be mapped through the TLB, so chasing a pointer into the remaining one will cause a miss. The probability of chasing a pointer into the unmapped page is 1/5, so we expect a miss rate of 1/5 = 20% and we get that. For 10GB, 4/10 pages are mapped and 6/10 aren't so the miss rate will be 6/10=60%, and we got that.

So your code works without modifications on my system at least. Your code does not appear to be problematic then.

I then did some research on CPU-World, and while not all CPUs are listed with TLB geometry data, some are. The only one I saw that matched your cpuid printout exactly (there could be more) is the Xeon Westmere-EP X5650; CPU-World does not explicitly say that the Data TLB0 has entries for 1GB pages, but does say the processor has "1 GB large page support".

I then did more research and finally nailed it. An author at RealWorldTech makes an (admittedly, I must yet find a source for this) off-hand comment in the discussion of the memory subsystem of Sandy Bridge. It reads as follows:

After address generation, uops will access the DTLB to translate from a virtual to a physical address, in parallel with the start of the cache access. The DTLB was mostly kept the same, but the support for 1GB pages has improved. Previously, Westmere added support for 1GB pages, but fragmented 1GB pages into many 2MB pages since the TLB did not have any 1GB page entries. Sandy Bridge adds 4 dedicated entries for 1GB pages in the DTLB.

(Emphasis added)

Conclusion

Whatever nebulous concept "CPU supports 1GB pages" represents, Intel thinks it does not imply "TLB supports 1GB page entries". I'm afraid that you will not be able to use 1GB pages on an Intel Westmere processor to reduce the number of TLB misses.

That, or Intel is hoodwinking us by distinguishing huge pages (in the TLB) from large pages.

How to use Intel Westmere 1GB pages on Linux?

1 Answers

TL;DR

Long version

Conclusion