1
votes

I understand the basic working of load-store queue, which is

  1. when loads compute their address, they check the store queue for any prior stores to the same address and if there is one then they gets the data from the most recent store else from write buffer or data cache.
  2. When stores compute their address, they check load queue for any load violations

My doubt is what happens when

  1. In the first case when the load access data cache due to some unresolved store addresses in the store queue and the access is miss in L1 data cache and before the data can be retrieved from the cache, the store address resolves. Now, the store does load queue checking for any violations. The dependent load has already accessed the data cache prior but didn't receive the value from cache yet due to long latency miss. Does the store post load violation or does it do store-to-load forwarding and cancel the data from cache?

  2. When load access miss in the l1 data cache, then the loads are placed in MSHR so as to not block the execute stage. When the miss resolves, the MSHR entry for that load has information regarding destination register and physical address. So the value can be updated in the physical register but how does the MSHR communicate with load queue that the value is available? when does this happen in the pipeline stage? Because I have read somewhere that MSHR store physical addresses and Load-store queue store virtual addresses. So how does MSHR communicate with LSQ?

I haven't found any resources regarding these doubts.

1
2: Intel CPUs for example replay the uops waiting for a cache-miss load result in anticipation of it being an L2 hit, then an L3 hit, and then apparently keep replaying them until they eventually succeed. (If those uops are the oldest for that port). Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?. And see also the top part of About the RIDL vulnerabilities and the "replaying" of loads - but take careful note of the edit-needed caveat.Peter Cordes
@Peter, at least in my tests on Skylake they only seem to speculatively dispatch in anticipation of an L1 or L2 hit, not L3 or beyond. That makes sense since L3 hits are not constant latency. So you usually get 3 total dispatches for a miss to L3 or DRAM if there is a single instruction directly dependent on the load. You could of course get more if there are more dependent instructions, and it gets especially interesting when you have a chain of dependent loads.BeeOnRope
@BeeOnRope: Maybe I'm misremembering, but I thought we'd (you'd) seen many extra dispatches over the time for a uop waiting for a cache miss from RAM. Probably that was with a pointer chasing test so we could consistently have exactly one cache-miss load in flight at once that had its address ready. IIRC L2-hit pointer-chasing had 1 extra dispatch, and L3-hit had a couple more, and it seemed L3-miss had enough extra to be explained by starting dispatching every 5 cycles after a certain point. Or something along those lines.Peter Cordes
@BeeOnRope: Is there a good Q&A with an updated description of uop replay? It seems I never got around to updating some of my answers after we discovered that it's not split loads or cache misses themselves that replay from the RS, it's the uop(s) dependent on them, so pointer chasing misled us. But I had hoped there was an accurate description somewhere outside of comments. Maybe on your wiki?Peter Cordes
@PeterCordes - yes, exactly you can see many replays per miss (up to ~10, IIRC), but those are in cases of "nested" replays like pointer chasing or in cases where many uops are dependent on the load. I don't recall any repeated dispatch over time for pure load misses as you describe. There are repeated dispatches over time for other cases though, maybe that's what you're thinking of: in the case of store-to-load forwarding you could see a lot of replays of the store over time if it depends on a missing load, or something.BeeOnRope

1 Answers

2
votes
  1. This is speculative execution where loads bypass older stores. When the older store is resolved, we can throw a load violation. If the probability of address aliasing is low then speculative execution is profitable (more throughput) - typically should be true for programs. On detecting a load violation, we can take appropriate step - (a) store-to-load forward, or (b) rollback pipeline to the resolved store.

  2. Same as when loads are served via cache hits (that can take 1-3 cycles for a L1 hit). For example in a reservation station with a CDB (common data bus), the result will be shared with all HW structures that need it.