Is there a penalty for accesses to virtual addresses which are mapped to the same physical address?

Question

Given the separation between virtual addresses that processes manipulate and the physical address that represent an actual location in memory, you can play some interesting tricks: such as creating a circular buffer without a discontinuity at the beginning/end of the allocated space.

I would like to know if such mapping tricks have a penalty for data read or write access in the case:

That access to the physical page is mostly through the same virtual mapping but only occasionally through the other mapping(s).
Access to the physical page(s) are spread more or less evenly between the virtual addresses that map to the same physical address.

I'm interested especially in x86 chips released over the last decade or so, but also in contemporary ARM and POWER chips.

You're only concerned with data accesses, not code-fetch, right? I think Intel's optimization manual mentions having multiple mappings for the same physical pages, but I forget what I've read. I think there are only penalties when one of the mappings is write-back but the other is not (e.g. USWC). Oh, store/load overlap detection may speculatively assume that pages are only mapped once, so that's worth checking on. — Peter Cordes
@PeterCordes yeah, I was thinking of data accesses, although I guess there are cool tricks to be played on the code side of things too :) — BeeOnRope
IIRC, the uop cache is virtually addressed, so multiple mappings for code might be less efficient. And JITing to one page and then executing it via an alias might not be great either, I forget. — Peter Cordes

Brendan Brendan · Accepted Answer · 2017-12-24T04:22:34

For 80x86 (I don't know about other architectures):

a) the normal instruction/data/unified caches are physically indexed (and therefore unaffected by paging tricks)

b) TLBs are virtually indexed. This means that (depending on a lot of things), for your circular buffer trick, you might expect a lot more TLB misses than you would have seen without the circular buffer trick. Things that could matter include the size of the area and the number of type of TLB entries used (4 KiB, 2 MiB/1 GiB); if the CPU prefetches TLB entries (recent CPUs do) and enough time is spent doing other work to ensure that the prefetched TLBs arrive before they're needed; and if the CPU caches higher level paging structures (e.g. page directories) to avoid fetching every level on a TLB miss (e.g. page table entry alone because the page directory was cached; or PML4 entry then PDPT entry then PD entry then page table entry).

c) Any uop cache (e.g. as part of a loop stream detector, or the old Pentium 4 "trace cache") is virtually indexed or not indexed at all (e.g. CPU just remembers "uops from start of loop"). That won't matter unless you have multiple copies of code; and if you do have multiple copies of code it becomes complicated (e.g. if duplication causes the number of uops to exceed the size of the uop cache).

d) Branch prediction is virtually indexed. This means that if you have multiple copies of the same code it becomes complicated again (e.g. it would increase "training time" for branches that aren't statically predicted correctly; and duplication can cause the number of branches to exceed the number of branch prediction slots and result in worse branch prediction).

e) The return buffer is virtually indexed, but I can't think of how that could matter (duplicating code wouldn't increase the depth of the call graph).

f) For store buffers (used for store forwarding); if stores are on different virtual pages then they have to assume a store may be aliased regardless of whether it is or not; and therefore shouldn't matter.

g) For write combining buffers; I'm honestly not sure if they're virtually indexed or physically indexed. Chances are that if it might matter you're going to run out of "write combining slots" before it actually does matter.

Is there a penalty for accesses to virtual addresses which are mapped to the same physical address?

2 Answers