1
votes

Given the separation between virtual addresses that processes manipulate and the physical address that represent an actual location in memory, you can play some interesting tricks: such as creating a circular buffer without a discontinuity at the beginning/end of the allocated space.

I would like to know if such mapping tricks have a penalty for data read or write access in the case:

  • That access to the physical page is mostly through the same virtual mapping but only occasionally through the other mapping(s).
  • Access to the physical page(s) are spread more or less evenly between the virtual addresses that map to the same physical address.

I'm interested especially in x86 chips released over the last decade or so, but also in contemporary ARM and POWER chips.

2
You're only concerned with data accesses, not code-fetch, right? I think Intel's optimization manual mentions having multiple mappings for the same physical pages, but I forget what I've read. I think there are only penalties when one of the mappings is write-back but the other is not (e.g. USWC). Oh, store/load overlap detection may speculatively assume that pages are only mapped once, so that's worth checking on.Peter Cordes
@PeterCordes yeah, I was thinking of data accesses, although I guess there are cool tricks to be played on the code side of things too :)BeeOnRope
IIRC, the uop cache is virtually addressed, so multiple mappings for code might be less efficient. And JITing to one page and then executing it via an alias might not be great either, I forget.Peter Cordes

2 Answers

4
votes

For 80x86 (I don't know about other architectures):

a) the normal instruction/data/unified caches are physically indexed (and therefore unaffected by paging tricks)

b) TLBs are virtually indexed. This means that (depending on a lot of things), for your circular buffer trick, you might expect a lot more TLB misses than you would have seen without the circular buffer trick. Things that could matter include the size of the area and the number of type of TLB entries used (4 KiB, 2 MiB/1 GiB); if the CPU prefetches TLB entries (recent CPUs do) and enough time is spent doing other work to ensure that the prefetched TLBs arrive before they're needed; and if the CPU caches higher level paging structures (e.g. page directories) to avoid fetching every level on a TLB miss (e.g. page table entry alone because the page directory was cached; or PML4 entry then PDPT entry then PD entry then page table entry).

c) Any uop cache (e.g. as part of a loop stream detector, or the old Pentium 4 "trace cache") is virtually indexed or not indexed at all (e.g. CPU just remembers "uops from start of loop"). That won't matter unless you have multiple copies of code; and if you do have multiple copies of code it becomes complicated (e.g. if duplication causes the number of uops to exceed the size of the uop cache).

d) Branch prediction is virtually indexed. This means that if you have multiple copies of the same code it becomes complicated again (e.g. it would increase "training time" for branches that aren't statically predicted correctly; and duplication can cause the number of branches to exceed the number of branch prediction slots and result in worse branch prediction).

e) The return buffer is virtually indexed, but I can't think of how that could matter (duplicating code wouldn't increase the depth of the call graph).

f) For store buffers (used for store forwarding); if stores are on different virtual pages then they have to assume a store may be aliased regardless of whether it is or not; and therefore shouldn't matter.

g) For write combining buffers; I'm honestly not sure if they're virtually indexed or physically indexed. Chances are that if it might matter you're going to run out of "write combining slots" before it actually does matter.

0
votes

If you're looking for possible penalties, I would start from the load-store forwarding logic. If you have a store to virtual addr A, and a later load from B, and both addresses map to the same physical one, you're going to confuse the hell out of your CPU.

The main issue is that these conflicts must be resolved as early as possible for loads to be fast (which most microarchitecture are usually optimized for), so some designs may match the virtual addresses (or parts of it) if you know them in time (if not, you can use memory disambiguation, but that's a different story). Keep in mind that L1 set mapping usually looks at the 12 lower addr bits, which allows your L1 lookup and TLB lookup to be performed in parallel - this wouldn't be possible if you also had to wait for a full match with any earlier store in the system. Luckily, if you alias virtual pages like you want you still get the same lower 12 bits thanks to the minimal 4k granularity, so this match would still work.

However, to be functionally foolproof, there must be a full physical match later on (once you have the full translation for the load and all older stores), so the design may at that point decide whether it wan't to forward the data based on a partial match (and risk having to flush everything), or wait for the full match. Either way may probably incur some delay, but I don't think aliasing would contribute to it unless you break the earlier partial check.