Why do we require two memory barriers in a postbox data communication between two cores?

Question

Here we have a code of postbox code for data communication between two ARM cores (directly referred from the ARM Cortex A Series Programming Guide).

Core A:

STR R0, [Msg] @ write some new data into postbox
STR R1, [Flag] @ new data is ready to read

Core B:

Poll_loop:
LDR R1, [Flag]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] @ read new data.

In order to enforce dependency, the document says that we need to insert not one, but two memory barriers, DMB, into the code.

Core A:

STR R0, [Msg] @ write some new data into postbox
DMB
STR R1, [Flag] @ new data is ready to read

Core B:

Poll_loop:
LDR R1, [Flag]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop
DMB
LDR R0, [Msg] @ read new data.

I understand the first DMB in the Core A: it prevents compile reordering and also the memory access to [Msg] variable be observed by the system. Below is the definition of the DMB from the same document.

Data Memory Barrier (DMB)
This instruction ensures that all memory accesses in program order before the barrier are observed in the system before any explicit memory accesses that appear in program order after the barrier. It does not affect the ordering of any other instructions executing on the core, or of instruction fetches.

However, I am not sure why the DMB in the Core B is used. In the document it says:

Core B requires a DMB before the LDR R0, [Msg] to be sure that the message is not read until the flag is set.

If the DMB in the Core A makes the store to the [Msg] be observed to the system, then we should not need the DMB in the second core. My guess is, the compiler might do a reordering of reading [Flag] and [Msg] in the Core B (though I do not understand why it should do this since the read on [Msg] is dependent on [Flag]).

If this is the case, a compile barrier (asm volatile("" ::: "memory) instead of DMB should be enough. Do I miss something here?

Don't know the memory model, but maybe core 2 is allowed to see writes in a different order from the one in which they occurred? Some Cortex-A's are out-of-order, so the load of the message could ahve run speculatively based on the branch predictor's output, independent of when the flag was fetched from memory. — twotwotwo

Notlikethat Notlikethat · Accepted Answer · 2016-02-07T01:26:02

Both barriers are necessary, and do need to be dmbs - this is still about the hardware memory model, and nothing to do with compiler reordering.

Let's look at the writer on core A first:

STR R0, [Msg] @ write some new data into postbox
STR R1, [Flag] @ new data is ready to read

Since these are two independent stores to different addresses with no dependency between them, there is nothing to force core A to actually issue the stores in program order. The store to Msg could, say, linger in a part-filled write buffer whilst the store to Flag overtakes it and goes straight out to the memory system. Thus any observer other than core A could see the new value of Flag, without yet seeing the new value of Msg.

STR R0, [Msg] @ write some new data into postbox
DMB
STR R1, [Flag] @ new data is ready to read

Now, with the barrier, the store to Flag is not permitted to be visible before the store to Msg, because that would necessitate one or other store appearing to cross the barrier. Thus any external observer may either see both old values, the new Msg but the old Flag, or both new values. The previous case of seeing the new Flag but the old Msg can no longer occur.

OK, so the first barrier handles things getting written in the correct order, but there's also the matter of how they are read. Over on core B...

Poll_loop:
LDR R1, [Flag]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] @ read new data.

Note that the branch to Poll_loop does not form a control dependency between the two loads; if you consider program order, the load of Msg is unconditional, and the value of Flag does not affect whether it is executed or not, only whether execution ever progresses to that part of the program at all. Therefore the code could equivalently be written thus:

Poll_loop:
LDR R1, [Flag]
LDR R0, [Msg] @ read data, just in case.
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop @ no? OK, throw away that data and read everything again.
... @ do stuff with R0, because Flag was set so it must be good data, right?

Start to see the problem? Even with the original code, core B is free to speculatively load Msg as soon as it reaches Poll_loop, so even if the stores from core A become visible in program order, things could still play out like this:

  core A   |  core B
-----------+-----------
           | load Msg
store Msg  |
store Flag |
           | load Flag
           | conclude that old Msg is valid

Thus you either need a barrier:

...
BEQ Poll_loop
DMB
LDR R0, [Msg] @ read new data.

or perhaps a fake address dependency:

...
BEQ Poll_loop
EOR R1, R1, R1
LDR R0, [Msg, R1] @ read new data.

To order the two loads against each other.

Why do we require two memory barriers in a postbox data communication between two cores?

2 Answers