Intel-x86:The interaction between WC, WB and UC Memory

Question

The memory ordering guarantees across different memory regions on x86 architectures are not clear to me. Specifically, the Intel manual states that WC, WB and UC follow different memory orderings as follows.

WC: weak ordering (where e.g. two stores on different locations can be reordered)

WB (as well as WT and WP, i.e. all cacheable memory types): processor ordering (a.k.a TSO, where younger loads can be reordered before older stores on different locations)

UC: strong ordering (where all instructions are executed in the program order and cannot be reordered)

What is not clear to me is the interaction between UC and the other regions. Specifically, the manual mentions:

(A) UC accesses are strongly ordered in that they are always executed in program order and cannot be reordered; and

(B) WC accesses are weakly-ordered and can thus be reordered.

So between (A) and (B) it is not clear how UC accesses and WC/WB accesses are ordered w.r.t. one another.

1a) [UC-store/WC-store ordering] For instance, let us assume that x is in UC memory and y is WC memory. Then in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two stores in thread 0 can be reordered. (I have put an mfence between the two loads hoping that it would stop the loads from being reordered, as it is not clear to me whether WC/UC loads can be reordered; see 3a below)

       thread 0       |   thread 1
     store [x] <-- 1  |   load [y]; mfence 
     store [y] <-- 1  |   load [x]

1b) What if instead (symmetrically) x were in WC memory and y were in UC memory?

2a) [UC-store/WB-load ordering] Similarly, can a UC-store and a WB-load (on different locations) be reordered? Let us assume that x is in UC memory and z is in WB memory. Then in the multi-threaded program below, is it possible for both loads to load 0? This would be possible if both x and z were in WB emory due to store buffering (or alternatively justified as: younger loads in each thread can be reordered before the older stores as they are on different locations). But since the accesses on x are in UC memory, it is not clear whether such behaviours are possible.

       thread 0       |   thread 1
     store [x] <-- 1  |   store [z] <-- 1 
     load [z]         |   load [x]

2b) [UC-store/WC-load ordering] What if z were in WC memory (and x is in UC memory)? Can both loads load 0 then?

3a) [UC-load/WC-load ordering] Can a UC-load and a WC-load be reordered? Once again, let us assume that x is in UC memory and y is in WC memory. Then, in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two loads could be reordered (I believe the two stores cannot be reordered due to the intervening sfence; the sfence may not be needed depending on the answer to 1a).

       thread 0               |   thread 1
     store [x] <-- 1; sfence  |   load [y] 
     store [y] <-- 1          |   load [x]

3b) What if instead (symmetrically) x were in WC memory and y were in UC memory?

4a) [WB-load/WC-load ordering] What if in the example of 3a above x were in WB memory (instead of UC) and y were in WC memory (as before)?

4b) What if (symmetrically) x were in WC memory and y were in WB memory?

Re: load ordering: I think only SSE4.1 movntdqa loads can be reordered, and only when loading from WC memory. (Otherwise they're just a slower movdqa: Do current x86 architectures support non-temporal loads (from "normal" memory)?). Yes, MFENCE (or maybe LFENCE) will order it wrt. other loads, or I think it's safe to just use normal loads, and they'll still have their usual acquire semantics even from WC mem. — Peter Cordes

Brendan Brendan · Accepted Answer · 2021-04-07T04:59:30

WARNING: I am ignoring cache coherency in all of this; because it complicates everything and doesn't make any difference to understanding how WB, WT, WP, WC or WC work, or any of the answers.

Assume you have 4 pieces, like:

          ________
         |        |
         | Caches |
         |________|
         /       \
  ______/_       _\__________________
 |        |     |                    |
 |  CPU   |-----|  Physical address  |
 |  core  |     |  space (e.g. RAM)  |
 |________|     |____________________|
        \        /
       __\______/_
      |           |
      | Write     |
      | combining |
      | buffer    |
      |___________|

As far as the CPU's core is concerned; everything is always "processor ordering" (total store ordering with store forwarding). The only difference between WC, WB, WT, WP and UC is the path data takes to go between the CPU core and the physical address space.

For UC, writes go directly to the physical address space and reads come directly from the physical address space.

For WC, writes go down to "write combining buffer" where they're combined with previous writes and eventually evicted from the buffer (and sent to the physical address space later). Reads from WC come directly from the the physical address space.

For WB, writes go to caches and are evicted from the caches (and sent to the physical address space) later. For WT writes go to both caches and the physical address space at the same time. For WP writes get discarded and don't reach the physical address space at all. For all of these, reads come from cache (and cause fetch from the physical address space into cache on "cache miss").

There are 3 other things that influence this:

store forwarding. Any store can be forwarded to a later load within "CPU core", regardless whether the area is supposed to be WC, WB, WT, ... or UC. This means that it's technically wrong to claim that 80x86 has "total store ordering".
non-temporal stores cause data to go to the write combining buffers (regardless of whether the memory area was originally WB or WT or ... or UC). Non-temporal reads allow a later non-temporal read to occur before an earlier store.
write fences prevent store forwarding and wait for the write combining buffer to be emptied. Read fences cause CPU to wait until earlier reads complete before allowing later reads. The mfence instruction combines the behavior of read fence and write fence. Note: I lost track of lfence - for some/recent CPUs I think it got perverted into hack to help mitigate "spectre" security problems (I think it became a speculative execution barrier rather than just a read fence).

Now...

1a)

  thread 0             |     thread 1
store [x_in_UC] <-- 1  |   load [y_in_WC]; mfence 
store [y_in_WC] <-- 1  |   load [x_in_UC]

In this case the mfence is irrelevant (the previous load [y_in_WC] acts like UC anyway); but the store to y_in_WC may take ages to make its way to the physical address space (which isn't important because it's possibly last anyway). It's not possible to load 1 from y and 0 from x.

1b)

   thread 0             |     thread 1
 store [x_in_WC] <-- 1  |   load [y_in_UC]; mfence 
 store [y_in_UC] <-- 1  |   load [x_in_WC]

In this case, the store [x_in_WC] may take ages to make its way to the physical address space; which means that the data loaded by load [x_in_WC] may fetch older data from the physical address space (even if the load is done after the store). It's very possible to load 1 from y and 0 from x.

2a) thread 0 | thread 1 store [x_in_UC] <-- 1 | store [z_in_WB] <-- 1 load [z_in_WB] | load [x_in_UC]

In this case there's nothing confusing at all (everything happens in the program order; it's just that store [z_in_WB] writes to cache and load [z_in_WB] reads from cache); and it's not possible for both loads to load 0. Note: an external observer (e.g. a device watching the physical address space) may not see the store to z_in_WB for ages.

2b)

   thread 0             |     thread 1
 store [x_in_UC] <-- 1  |   store [z_in_WC] <-- 1
 load [z_in_WC]         |   load [x_in_UC]

In this case the store [z_in_WC] may not reach the physical address space until after the load [z_in_WC] has occurred (even if the load is done after the store). It is possible for both loads to load 0.

3a) thread 0 | thread 1 store [x_in_UC] <-- 1 | load [y_in_WC] store [y_in_WC] <-- 1 | load [x_in_UC]

Same as "1a". It's not possible to load 1 from y and 0 from x.

3b)

   thread 0             |     thread 1
 store [x_in_WC] <-- 1  |   load [y_in_UC]
 store [y_in_UC] <-- 1  |   load [x_in_WC]

Same as "1b". It's very possible to load 1 from y and 0 from x.

3c)

   thread 0             |     thread 1
 store [x_in_WC] <-- 1  |   load [y_in_UC]
 sfence                 |   load [x_in_WC]
 store [y_in_UC] <-- 1  |

The sfence forces thread 0 to wait for the write combining buffer to drain, so it's not possible to load 1 from y and 0 from x.

4a)

   thread 0             |     thread 1
 store [x_in_WB] <-- 1  |   load [y_in_WC]
 store [y_in_WC] <-- 1  |   load [x_in_WB]

Mostly the same as "1a" and "3a". The only difference is that the store to x_in_WB goes to caches (and the load to x_in_WB comes from caches). Note: an external observer (e.g. a device watching the physical address space) may not see the store to x_in_WB for ages.

4b)

   thread 0             |     thread 1
 store [x_in_WC] <-- 1  |   load [y_in_WB]
 store [y_in_WB] <-- 1  |   load [x_in_WC]

Mostly the same as "1b" and "3b". Note: an external observer (e.g. a device watching the physical address space) may not see the store to y_in_WB for ages.

Intel-x86:The interaction between WC, WB and UC Memory

2 Answers