I'm wonderring why are memory barriers needed and I have read some articles about this toppic.
Someone says it's because of cpu out-of-order execution while others say it is because of cache consistency problems which store buffer and invalidate queue cause.
So, what's the real reason that memory barriers are needed? cpu out-of-order execution or cache consistency problems? or both? Does cpu out-of-order execution have something to do with cache consistency? and what's the difference between x86 and arm?
1 Answers
You need barriers to order this core / thread's accesses to globally-visible coherent cache when the ISA's memory ordering rules are weaker than the semantics you need for your algorithm.
Cache is always coherent, but that's a separate thing from consistency (ordering between multiple operations).
You can have memory reordering on an in-order CPU. In more detail, How is load->store reordering possible with in-order commit? shows how you can get memory reordering on a pipeline that starts executing instructions in program order, but with a cache that allows hit-under-miss and/or a store buffer allowing OoO commit.
Related:
- Does an x86 CPU reorder instructions? talks about the difference between memory reordering vs. out of order exec. (And how x86's strongly ordered memory model is implemented on top of aggressive out-of-order execution by having hardware track ordering, with the store buffer decoupling store execution from store visibility to other threads/cores.)
- x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding
- Globally Invisible load instructions
See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120930/weak-vs-strong-memory-models for some more basics. x86 has a "strong" memory ordering model: program order plus a store buffer with store-forwarding. C++ acquire
and release
are "free", only atomic RMWs and seq_cst stores need barriers.
ARM has a "weak" memory ordering model: only C++ memory_order_consume
(data dependency ordering) is "free", acquire and release require special instructions (like ldar
/ stlr
) or barriers.