Firstly, when the IFU issues a request for 16 bytes, is this interaction with the L1I modified/fixed such that when L1I receives an address from the IFU it will subsequently produce 16 bytes in succession or does the IFU have to send the addresses of all 16 bytes like with a traditional cache access?
To get to the point, assume the IFU is fetching instructions at the 16B aligned boundaries and suddenly the virtual index (and i'm assuming the virtual index is indeed logical virtual and not linear virtual -- not entirely sure; I know with L1D the AGU handles the segmentation offsets) misses in the L1i cache.
What would happen exactly? (Note: example CPU Skylake with a ring bus topology)
Would the front end be shut down when the decoders are finished decoding whatever was before it, how would this be done? Secondly, what sort of negotiation / conversation is there between the IFU and the L1I cache, there is a miss, the cache must inform the IFU so that it stops fetching instructions? Perhaps the cache waits to receive the data from lower down and as soon as it does, issue the data to the IFU, or does the IFU wait in a spin-lock state and keep attempting the read?
Let's assume that the data it wants is on a DDR4 module and not in the cache subsystem at all -- possible if an erratic program is causing difficulties for the hardware prefetchers. I'd like to get the process clear in my mind.
- L1I cache miss, ITLB hit.
- L1I cache controller allocates a line fill buffer
- L1I cache controller requests from the L2 and passes physical address to it (these operations do not clash with the hardware prefetchers operations because all cache accesses must be sequential or queued I'd imagine)
- L2 miss, passes address to LLC slice
- LLC slice miss
- Caching agent sends address to the home agent
- Home agent detects no cores with the data
- Home agent sends the address to the memory controller
- Memory controller converts address to (channel, dimm, rank, IC, chip, bank group, bank, row, column) tuple and does the relevant mapping, interleaving, command generation etc.
- Now, since it's DDR4 it's going to return 128 bytes, but for simplification, assume it is DDR3 now, so 64 bytes. 64 bytes are sent back to the home agent, I assume this is all kept in queue order, so the home agent knows what address the data corresponds to.
- The home agent sends the data to the caching agent, again I assume the caching agent perhaps keeps some backlog of misses to know that it needs to be sent higher
- The data is passed to L2, don't know how L2 knows it needs to go higher but there you go
- L2 controller passes the information to L1 and L1 knows, again, somehow, what line fill buffer to enter the requested cache line into and that it requires an F tag (forwarding).
- The IFU either picks it up in its spin-lock state or some negotiation takes place with the IFU
If anyone has some more information on this process and can enlighten me further, please let me know.