2
votes

We know that the dirty victim data is not immediately written back to RAM, it is stashed away in the store buffer and then written back to RAM later as time permits. Also, the store forwarding technique that if you do a subsequent LOAD to the same location on the same core before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. This can be done in parallel with the cache access, so it doesn’t slow things down.

My question is - With the help of the store buffer and store forwarding, the store misses don’t necessarily require the processor (correspond core) to stall. Therefore, store misses do not contribute to the total cache miss latency, right?

Thanks.

1

1 Answers

3
votes

DRAM latency is really high, so it's easy for the store buffer to fill up and stall allocation of new store instructions into the back-end when a cache miss store stalls its progress. The ability of the store buffer to decouple / insulate execution from cache misses is limited by its finite size. It always helps some, though. You're right, stores are easier to hide cache-miss latency for.

Stalling and filling up the store buffer is more of a problem with a strongly ordered memory model like x86's TSO: stores can only commit from the store buffer into L1d cache in program order, so any cache-miss store blocks store-buffer progress until the RFO (Read For Ownership) completes. Initiating the RFO early (before the store reaches the commit end of the store buffer, e.g. upon retire) can hide some of this latency by getting the RFO in flight before the data needs to arrive.

Consecutive stores into the same cache line can be coalesced into a buffer that lets them all commit at once when the data arrives from RAM (or from another core which had ownership). There's some evidence that Intel CPUs actually do this, in the limited cases where that wouldn't violate the memory-ordering rules.