Why can't a load bypass a value written by another thread on the same core from a write buffer?

Question

If a CPU core uses a write buffer, then the load can bypass the most recent store to the referenced location from the write buffer, without waiting until it will appear in the cache. But, as it's written in A Primer on Memory Consistency and Coherence, if the CPU honors TSO memory model, then

... multithreading introduces a subtle write buffer issue for TSO. TSO write buffers are logically private to each thread context (virtual core). Thus, on a multithreaded core, one thread context should never bypass from the write buffer of another thread context. This logical separation can be implemented with per-thread-context write buffers or, more commonly, by using a shared write buffer with entries tagged by thread-context identifiers that permit bypassing only when tags match.

I can't grasp the necessity of this limitation. Could you please give me an example when allowing some thread to bypass a write buffer entry written by another thread on the same core leads to the violation of the TSO memory model?

I'm voting to close this question as off-topic because it is about computer processor design, not programming. — Raymond Chen
OP has tagged the question appropriately, I'd say its a valid question, so don't close it. — Isuru H
I think its there because, data in the store buffer has not being through the coherence protocol, hence allowing another thread to see writes early before making them globally visible could be a violation. Think of a 4 thread situation where two threads modifies the same location and other two try to read it. [My knowledge is bit rusty on write buffers, I'm pulling my hair to understand what happen to the writes in the above scenario, presumably one has to be redone ] — Isuru H
@IsuruH Do you mean that all 4 threads are running on the same core? I can't see the problem. Assuming all threads are sharing the same write buffer, 2 storing threads just push their values into the buffer, while 2 loading threads are taking the latest value from the buffer. — undermind

Leeor Leeor · Accepted Answer · 2017-03-17T09:26:08

The classic example of how TSO differs from sequential consistency (SC) is:

(This is example 2.4 here - http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf)

  thread 0     |     thread 1
---------------------------------
write 1-->[x]  |   write 1-->[y]    
a = read [x]   |   b = read  [y]    
c = read [y]   |   d = read  [x]

Both addresses store 0 initially. The question is: would c=d=0 be a valid outcome? We know a and b must forward the stores before them since they match the addresses of the local stores, and will probably be forwarded from the local threads store buffer. However, c and d may not be forwarded across context, so they may still show the old value.

The interesting gotcha here is that since each thread observes both stores, and forwards the local one, and outcome of a=1,c=0 would mean that t0 saw the store to [x] occurring first. An outcome of b=1,d=0 would mean that t1 saw the store to [y] occurring first. The fact that this is a possible outcome due to store buffer forwarding would break sequential consistency as it requires that all contexts agree on the same global order of stores. Instead, x86 settled for a weaker TSO model that allows this case.

Forwarding stores globally is practically impossible since buffered stores are not necessarily committed, which means they may even be in the wrong path of a branch misprediction. Forwarding locally is fine since a flush would also eliminate all the loads that forwarded from them, but on multiple contexts you don't have that. I've also seen work that tries to buffer stores globally outside of the core, but this is not very practical due to latency and bandwidth. For further reading, here's a recent paper that may be relevant - http://ieeexplore.ieee.org/abstract/document/7783736/

Why can't a load bypass a value written by another thread on the same core from a write buffer?

1 Answers