2
votes

Firstly, when the IFU issues a request for 16 bytes, is this interaction with the L1I modified/fixed such that when L1I receives an address from the IFU it will subsequently produce 16 bytes in succession or does the IFU have to send the addresses of all 16 bytes like with a traditional cache access?

To get to the point, assume the IFU is fetching instructions at the 16B aligned boundaries and suddenly the virtual index (and i'm assuming the virtual index is indeed logical virtual and not linear virtual -- not entirely sure; I know with L1D the AGU handles the segmentation offsets) misses in the L1i cache.

What would happen exactly? (Note: example CPU Skylake with a ring bus topology)

Would the front end be shut down when the decoders are finished decoding whatever was before it, how would this be done? Secondly, what sort of negotiation / conversation is there between the IFU and the L1I cache, there is a miss, the cache must inform the IFU so that it stops fetching instructions? Perhaps the cache waits to receive the data from lower down and as soon as it does, issue the data to the IFU, or does the IFU wait in a spin-lock state and keep attempting the read?

Let's assume that the data it wants is on a DDR4 module and not in the cache subsystem at all -- possible if an erratic program is causing difficulties for the hardware prefetchers. I'd like to get the process clear in my mind.

  • L1I cache miss, ITLB hit.
  • L1I cache controller allocates a line fill buffer
  • L1I cache controller requests from the L2 and passes physical address to it (these operations do not clash with the hardware prefetchers operations because all cache accesses must be sequential or queued I'd imagine)
  • L2 miss, passes address to LLC slice
  • LLC slice miss
  • Caching agent sends address to the home agent
  • Home agent detects no cores with the data
  • Home agent sends the address to the memory controller
  • Memory controller converts address to (channel, dimm, rank, IC, chip, bank group, bank, row, column) tuple and does the relevant mapping, interleaving, command generation etc.
  • Now, since it's DDR4 it's going to return 128 bytes, but for simplification, assume it is DDR3 now, so 64 bytes. 64 bytes are sent back to the home agent, I assume this is all kept in queue order, so the home agent knows what address the data corresponds to.
  • The home agent sends the data to the caching agent, again I assume the caching agent perhaps keeps some backlog of misses to know that it needs to be sent higher
  • The data is passed to L2, don't know how L2 knows it needs to go higher but there you go
  • L2 controller passes the information to L1 and L1 knows, again, somehow, what line fill buffer to enter the requested cache line into and that it requires an F tag (forwarding).
  • The IFU either picks it up in its spin-lock state or some negotiation takes place with the IFU

If anyone has some more information on this process and can enlighten me further, please let me know.

1
All split / unaligned load handling is done inside the L1i cache. But I think instruction fetch on Intel CPUs is done in aligned 16-byte blocks. There are queues between later stages that allow grouping into unaligned chunks, though, so maybe only L1d has to deal with unaligned 16-byte / 32-byte loads. Outer caches will only see requests for whole lines, so the ring-bus interconnect between cores doesn't matter at all. The DRAM interface is also irrelevant. (I guess you could run code from an uncacheable memory region, but you're asking about caches.)Peter Cordes
Intel's L1 caches are VIPT: the tags are physical, not virtual. realworldtech.com/sandy-bridge/7 and realworldtech.com/haswell-cpu/6. The uop-cache is virtually addressed, though. See Agner Fog's microarch guide (agner.org/optimize), and other links in the x86 tag wiki (stackoverflow.com/tags/x86/info).Peter Cordes
seg:off -> linear virtual translation happens before anything else. For data loads, it costs an extra cycle of latency if the segment base is non-zero. I assume instruction-cache loads are similar: the input to uop-cache checks and L1i + L1iTLB is a linear virtual address.Peter Cordes
@PeterCordes Sorry, meant virtual index. I'll correct.Lewis Kelsey

1 Answers

2
votes

Interesting question once you get past some of the misconceptions (see my comments on the question).

Fetch/decode happens strictly in program order. There's no mechanism for decoding a block from a later cache line while waiting on an L1i miss, not even to populate the uop-cache. My understanding is that the uop-cache is only ever populated with instructions the CPU expects to actually execute along the current path of execution.

(x86's variable-length instructions mean that you need to know an instruction boundary before you can even start decoding. This could be possible if branch-prediction says the cache-miss instruction block will branch somewhere in another cache line, but current hardware isn't built that way. There's nowhere to put the decoded instructions where the CPU could come back and fill in the gap.)


There's hardware prefetching into L1i (which I assume does take advantage of branch prediction to know where to branch next even if the current fetch is blocked on a cache miss), so code-fetch can generate multiple outstanding loads in parallel to keep the memory pipeline better occupied.

But yes, an L1i miss creates a bubble in the pipeline which lasts until data arrives from L2. Every core has its own private per-core L2 which takes care of sending requests off-core if it misses in L2. WikiChip shows the data path between L2 and L1i is 64 bytes wide in Skylake-SP.

https://www.realworldtech.com/haswell-cpu/6/ shows L2<->L1d is 64 bytes wide in Haswell and later, but doesn't show as much detail for instruction-fetch. (Which is often not a bottleneck, especially for small to medium-sized loops that hit in the uop cache.)

There are queues between fetch, pre-decode (instruction boundaries) and full decode, which can hide / absorb these bubbles and sometimes stop them from reaching the decoders and actually hurting decode throughput. And there's a larger queue (64 uops on Skylake) that feeds the issue/rename stage, called the IDQ. Instructions are added to the IDQ from the uop cache or from legacy decode. (Or when a microcode-indirect uop for an instruction that takes more than 4 uops reaches the front of the IDQ, issue/rename fetches directly from the microcode sequencer ROM, for instructions like rep movsb or lock cmpxchg.)

But when a stage has no input data, yes it powers down. There's no "spin-lock"; it's not managing exclusive access to a shared resource, it's simply waiting based on a flow-control signal.

This also happens when code fetch hits in the uop cache: the legacy decoders can power down as well. Power saving is one of the benefits of the uop cache, and of the loopback buffer saving power for the uop cache.


L1I cache controller allocates a line fill buffer

L2->L1i uses different buffers than the 10 LFBs that L1d cache / NT stores use. Those 10 are dedicated to the connection between L1d and L2.

The Skylake-SP block diagram on WikiChip shows a 64-byte data path from L2 to L1i, separate from the L2->L1d with its 10 LFBs.

L2 has to manage having multiple readers and writers (L1 caches, and data to/from L3 on its SuperQueue buffers). @HadiBrais comments that we know that the L2 can handle 2 hits per clock cycle, but number of misses per cycle it can handle / generate L3 requests for is less clear.

Hadi also commented: The L2 has one read 64-byte port for the L1i and one bidirectional 64-byte port with the L1d. It also has a read/write port (64-byte in Skylake, 32-byte in Haswell) with the L3 slice it is connected to. When the L2 controller receives a line from the L3, it immediately writes it into the corresponding superqueue entry (or entries).

I haven't checked a primary source for this, but it sounds right to me.


Fetch from DRAM happens with burst transfers of 64 bytes (1 cache line) at once. Not just 16 bytes (128 bits)! It's possible to execute code from an "uncacheable" memory region, but normally you're using WB (write-back) memory regions that are cacheable.

AFAIK, even DDR4 has a 64-byte burst size, not 128 bytes.

I assume this is all kept in queue order, so the home agent knows what address the data corresponds to.

No, the memory controller can reorder requests for locality within a DRAM page (not the same thing as a virtual-memory page).

Data going back up the memory hierarchy has an address associated with it. It gets cached by L3, and L2, because they have a write-allocate cache policy.

When it arrives at L2, the outstanding request buffer (from L1i) matches the address, so L2 forwards that line to L1i. Which in turn matches the address and wakes up the instruction-fetch logic that was waiting.

@HadiBrais commented: Requests at the L2 need to be tagged with the sender ID. Requests at the L3 need to be tagged with yet another sender ID. The requests at the L1I need not be tagged.

Hadi also discussed the fact that L3 needs to deal with requests from multiple cores per cycle. The ring bus architecture in CPUs before Skylake-SP / SKX meant that at most 3 requests could arrive at a single L3 slice per clock (one in each direction on the ring, and one from the core attached to it). If they were all for the same cache line, it would definitely be advantageous to satisfy them all with a single fetch from this slice, so this might be something that L3 cache slices do.


See also Ulrich Drepper's What Every Programmer Should Know About Memory? for more about cache and especially about DDR DRAM. Wikipedia's SDRAM article also explains how burst transfers of whole cache lines from DRAM work.

I'm not sure whether Intel CPUs actually pass along an offset within a cache line for critical-word-first and early-restart back up the cache hierarchy. I'd guess not, because some of the closer-to-the-core data paths are much wider than 8 bytes, 64 bytes wide in Skylake.

See also Agner Fog's microarch pdf (https://agner.org/optimize/), and other links in the x86 tag wiki.