3
votes

I think, to make the CPU continue executing subsequent instructions,the store buffer must do part of the MESI processing to get cache consistency, because the latest value is stored in store buffer and not cache. So the store buffer sends read invalidate or invalidate REQ messages and flushes the latest value to cache after the arrival of ACK.
And Cache cannot do it.

Is my analysis and result right?
Or shall all MESI processing be done by cache?

1
The store buffer does not participate in cache coherence at all because it doesn't have to. Requests in the store buffer get sent to the L1 controller (or what hardware structure in the coherence domain) and then the L1 controller is the one that participates in coherence by requesting ownership for the target cache line. Any subsequent instructions executing on the same logical core will nonetheless use the result of store even before the ownership request gets satisfied. This doesn't violate coherence because other cores cannot see the results of these instructions until the store retires.Hadi Brais
I'm assuming that by "cache consistency" you're referring to cache coherence (formally, there is a distinction between them).Hadi Brais
@HadiBrais: A CPU can optimize by sending RFOs early, so lines will become hot in L1d sooner and cache-miss stores aren't delayed so long, vs. if you just wait until a store is ready to commit from the store buffer to L1d. For example, one of Skylake's features is L1 store misses generate L2 requests much earlier in Skylake than before. (Intel's optimization manual says that, too.) I'm not sure if that's what the OP is asking.Peter Cordes
@LosGeles, why is this required to "continue executing subsequent instructions"? Any younger load would get the store data forwarded regardless of the store status (even before it commits)Leeor
The store can retire before the RFO has been granted - but it can't commit (leave the store buffer and become visible at the coherence point) unitl that happens. Stores can stay in the store buffer after retirement.BeeOnRope

1 Answers

3
votes

On most designs the store buffer wouldn't directly send invalidate requests and is usually not even snooped1 by external requests. That is, it is part of the private/core-side of the coherence domain and so doesn't need to participate in coherence. Instead, the store buffer ultimately interacts with the first level of the caching subsystem which itself would be responsible for the various parts of the MESI protocol.

How that interaction works exactly depends on the design, of course. A simple design may only process one store at a time: the oldest one that is at the head of the store buffer and perform the RFO for that address, and when complete move on the to the next element. A more sophisticated design might send RFO for several "upcoming" requests in the store buffer in an attempt to exploit more MLP. The exact mechanism isn't clear to me on x86: stores to L2 seem to perform quite poorly in some scenarios, but I'm pretty sure a bunch of store misses to RAM will perform much better than if they were handled serially.


1 There are some exceptions, e.g. simultaneous multithreading (hyperthreading on x86) which involves two logical cores sharing all levels of cache and hence being able to avail themselves of the normal cache coherency mechanisms, may require store buffer snoops.