How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?

Question

I've been reading about how the x86 memory model works and the significance of the barrier instructions on x86 and comparing to other architectures such as ARMv8. In both the x86 and ARMv8 architecture, it appears(no pun intended) that the memory models respect transitivity/cumulativity, i.e if CPU 1 sees stores by CPU0, and CPU2 sees stores by CPU1 that could only have occurred if CPU1 saw CPU0 stores, then CPU2 must also see CPU0's store. The examples i'm referring to are example 1 and 2 in section 6.1 of Paul McKenney's famous paper(relevant albeit old, the same thing exists in his latest perf cook book, http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf). If i understand correctly, x86 uses store queue's(or store order buffers) to order the stores(and for other micro arch optimizations)before they become globally visible(ie written to L1D). My question is how does the x86 arch(and other arch's) implement(micro-architecturally) the transitivity property ? The store queue ensures that the particular CPU's stores are made visible globally in a particular order, but what ensures the ordering of stores made by one CPU ordered with stores made by different CPU's ?

Peter Cordes Peter Cordes · Accepted Answer · 2019-09-19T20:55:05

On x86, there is only one coherency domain. Stores become visible to all other cores at exactly the same time, when they commit to L1d cache. That along with MESI in general is enough to give us a total store order that all threads can agree on.

A few ISAs (including PowerPC) don't have that property (in practice because of store-forwarding for retired store within a physical core, across SMT threads). So mo_relaxed stores from 2 threads can be seen in different orders by 2 other readers in practice on POWER hardware. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? (Presumably barriers on PowerPC block that forwarding.)

The ARM memory model used to allow this IRIW (Independent Reader Independent Writer) reordering, but in practice no ARM HW ever existed that did it. ARM was able to strengthen their memory model to guarantee that all cores agree on a global order for stores done by multiple other cores.

(Store forwarding still means that the core doing the store sees it right away, long before it becomes globally visible. And of course load ordering is required for cores to be able to say they saw anything about what they observed for the ordering of independent writes.)

If all cores must agree on the global ordering of stores, then (in your example) seeing the store from Core2 implies that Core1 must have already happened, and that you can see it, too.

(Assuming that Core2 used appropriate barriers or acquire-load or release-store to make sure its store happened after its load that saw Core1's store.)

Possibly also related:

Concurrent stores seen in a consistent order

How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?

1 Answers