Dependent loads reordering in CPU

Question

I have been reading Memory Barriers: A Hardware View For Software Hackers, a very popular article by Paul E. McKenney.

One of the things the paper highlights is that, very weakly ordered processors like Alpha, can reorder dependent loads which seems to be a side effect of partitioned cache

Snippet from the paper:

1 struct el *insert(long key, long data)
2 {
3     struct el *p;
4     p = kmalloc(sizeof(*p), GPF_ATOMIC);
5     spin_lock(&mutex);
6     p->next = head.next;
7     p->key = key;
8     p->data = data; 
9     smp_wmb();
10    head.next = p;
11    spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16     struct el *p;
17     p = head.next;
18     while (p != &head) {
19         /* BUG ON ALPHA!!! */
20         if (p->key == key) {
21             return (p);
22         }
23         p = p->next;
24     };
25     return (NULL);
26 }

There are 2 processors CPU0 and CPU1.
Each CPU has 2 cache banks CB0( odd address ), CB1( even address ).
Head is in CB0 and P in CB1.
The insert() has a write barrier which ensure that invalidation for line 6-8 is the in bus first followed by invalidation at line 10.
However, the other processor executing search can have CB0 lightly loaded and CB1 heavily loaded.
This means the processor leads the latest value of head but old value of p ( because the invalidation request for p is not processed by CB1 yet. )

Question: Looks like all architectures expect Alpha honor dependent loads. For example: IA64 can reorder the following except Dependent loads reordering.

Load reordered after load
Load reordered after store
Stores reordered after stores
Stores reordered after load
Atomic instruction reordered with loads.
Atomic Instructions reordered with stores.

This makes me wonder what hardware support is required to prevent dependent load reordering.

One possible answer is that all other architecture( IA64) do not have a partitioned cache and hence would not run into this issue and no explicit hardware support is required.

Any insights ?

I know I've seen an interesting mailing list archive where Linus Torvalds was saying that only a few models of real Alpha hardware could reorder dependent loads, so the (costly) memory barriers needed all over the place felt like even more of a burden. And also he was saying the out-of-order CPUs need to track dependencies anyway to give correct single-thread behaviour, so the extra burden to provide stronger memory-ordering semantics for SMP ranges from negligible to small. I haven't found it yet :/ — Peter Cordes
I did find this article while looking: linuxjournal.com/node/8211/print. It's by the same author as the paper you linked, but I haven't done more than glance at your link yet. IDK how much overlap there is. — Peter Cordes
To answer your short question about what hardware is required to prevent dependent load reordering, the answer is that the load needs to be pegged to the cache line so that if the cache line is invalidated (due to a write from another core), the load is repeated. This is how x86 does it. — David Schwartz
Could you elaborate on this in detail with example. Links would be help too. — KodeWarrior
You might find this discussion interesting: Linus argues that having hardware with fast memory barriers means it already needs to track a lot of stuff, so it might as well just go all the way and make things much easier for software by being like x86 and having implicit barriers between every memory op. big thread, many good posts. Also Linus discusses dependent load reordering on Alpha and explaining the many errors in a custom lock implementation. — Peter Cordes

Gabriel Southern Gabriel Southern · Accepted Answer · 2016-02-03T08:42:13

Short answer:

In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication.

Long answer:

Background on dependence tracking

This is probably best explained using an example. Imagine that you had the following sequence of instructions (pseudo-code instructions used for simplicity):

ST R1, A       // store value in register R1 to memory at address A
LD B, R2       // load value from memory at address B to register R2
ADD R2, 1, R2  // add immediate value 1 to R2 and save result in R2

In this example there is a dependency between the LD and the ADD instruction. The ADD reads the value of R2 and so it cannot execute until the LD makes that value available. This dependency is through a register and it is something that the processor's issue logic can track.

However, there could also be a dependency between the ST and the LD, if address A and B were the same. But unlike the dependence between the LD and the ADD, the possible dependence between the ST and the LD is not known at the time the instruction is issued (begins execution).

Instead of trying to detect memory dependencies at issue time, the processor keeps track of them using a structure called the load-store queue. What this structure does is keep track of the addresses of pending loads and stores for instructions that have been issued but not yet retired. If there is a memory ordering violation this can be detected and execution can be restarted from the point where the violation occurred.

So going back to the pseudo-code example, you could imagine a situation where the LD is executed before the ST (perhaps the value needed in R1 wasn't ready for some reason). But when the ST executes it sees that address A and B are the same. So the LD should really have read the value that was produced by the ST, rather than the stale value that was already in the cache. As a result the LD will need to be re-executed, along with any instructions that came after the LD. There are various optimizations possible to reduce some of this overhead, but the basic idea holds.

As I mentioned earlier the logic to detect this dependence exists in all out-of-order processors that allow speculative execution of memory instructions (including Alpha processors).

Memory ordering rules

However, memory ordering rules don't just constrain the order that a processor sees results from its own memory operations. Instead memory ordering rules constrain the relative order of that operations memory operations performed on one processor become visible to other processors.

Alpha example

In the case of dependent load reordering, the processor has to track this information for its own use, but Alpha ISA does not require it to make sure that other processors see this ordering. One example of how this can occur is the following (I've quoted from this link)

Initially: p = & x, x = 1, y = 0

    Thread 1         Thread 2
--------------------------------
  y = 1         |    
  memoryBarrier |    i = *p
  p = & y       |
--------------------------------
Can result in: i = 0

The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.

Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an "invalidate y" to be sent to P2. This invalidate goes into the incoming "probe queue" of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in P2's cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2's probe queue is still sitting there).

How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for "read p". Nonetheless, if you get into a situation where you have placed many things in P2's probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.

The above addresses how current Alpha's may violate what you have shown. Future Alpha's can violate it due to other optimizations. One interesting optimization is value prediction.

Summary

The basic hardware needed to enforce the ordering of dependent loads is already present in all out-of-order processors. But ensuring that this memory ordering is seen by all processors adds additional constraints to handling of cache-line invalidation. And it may add additional constraints in other scenarios as well. However, in practice it seems likely that the potential advantages of the weak Alpha memory model for hardware designers were not worth the cost in software complexity and added overhead of requiring more memory barriers.