CPU cycle speed

Question

Finding the latencies of L1/L2/L3 caches is easy:

Approximate cost to access various caches and main memory?

but I am interested in what the cost is (in CPU cycles) for translating a virtual address to physical page address when:

There is a hit in the L1 TLB
There is a miss in the L1 TLB but a hit in the L2 TLB
There is a miss in the L2 TLB and a hit in the page table
(I dont think there can be a miss in the page table can there? If there can, cost of this)

I did find this:

Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access

TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access

Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY

PDE cache = 32 items?

http://www.7-cpu.com/cpu/SandyBridge.html

but it doesn't mention the cost of a hit/accessing the relevant TLB cache?

I suspect the down vote came from a view that this is off-topic being about hardware and perhaps excessively broad since it depends on specific hardware and (in the case of page table misses) specific operating system software. Some processors do not have L2 TLBs and use software TLB fill, in which case the delay for TLB fill would depend on the specific hardware and operating system. — Paul A. Clayton

Paul A. Clayton Paul A. Clayton · Accepted Answer · 2014-07-07T17:05:23

Typically the L1 TLB access time will be less than the cache access time to allow tag comparison in a set associative, physically tagged cache. A direct mapped cache can delay the tag check by assuming a hit. (For an in-order processor, a miss with immediate use of the data would need to wait for the miss to be handled, so there is no performance penalty. For an out-of-order processor, correcting for such wrong speculation can have noticeable performance impact. While an out-of-order process is unlikely to use a direct mapped cache, it may use way prediction which can behave similarly.) A virtually tagged cache can (in theory) delay TLB access even more since the TLB is only needed to verify permissions not to determine a cache hit and the handling of permission violations is generally somewhat expensive and rare.

This means that L1 TLB access time will generally not be made public since it will not influence software performance tuning.

L2 hit time would be equivalent to the L1 miss penalty. This will vary depending on the specific implementation and may not be a single value. E.g., If the TLB uses banking to support multiple accesses in a single cycle, bank conflicts can delay accesses, or if rehashing is used to support multiple page sizes, a page of the alternate size will take longer to find (both of these cases can accumulate delay under high utilization).

The time required for an L2 TLB fill can vary greatly. ARM and x86 use hardware TLB fill using a multi-level page table. Depending on where page table data can be cached and whether there is a cache hit, the latency of a TLB fill can be between the latency of a main memory access for each level of the page table and the latency of the cache where the page table data is found for each level (plus some overhead).

Complicating this further, more recent Intel x86 have paging-structure caches which allow levels of the page table to be skipped. E.g., if a page directory entry (an entry in a second level page table which points to a page of page table entries) is found in this cache, rather than starting from the base of the page table and doing four dependent look-ups only a single look-up is required.

(It might be worth noting that using a page the size of the virtual address region covered by a level of the page table (e.g., 2 MiB and 1 GiB for x86-64), reduces the depth of the page table hierarchy. Not only can using such large pages reduce TLB pressure, but it can also reduce the latency of a TLB miss.)

A page table miss is handled by the operating system. This might result in the page still being in memory (e.g., if the write to swap has not been completed) in which case the latency will be relatively small. (The actual latency will depend on how the operating system implements this and on cache hit behavior, though cache misses both for the code and the data are likely since paging is an uncommon event.) If the page is no longer in memory, the latency of reading from secondary storage (e.g., a disk drive) is added to the software latency of handling an invalid page table entry (i.e., a page table miss).

CPU cycle speed

1 Answers