Are two consequent CPU stores on x86 flushed to the cache keeping the order?

Question

Assume there are two threads running on x86 CPU0 and CPU1 respectively. Thread running on CPU0 executes the following commands:

A=1
B=1

Cache line containing A initially owned by CPU1 and that containing B owned by CPU0.

I have two questions:

If I understand correctly, both stores will be put into CPU’s store buffer. However, for the first store A=1 the cache of CPU1 must be invalidated while the second store B=1 can be flushed immediately since CPU0 owns the cache line containing it. I know that x86 CPU respects store orders. Does that mean that B=1 will not be written to the cache before A=1?
Assume in CPU1 the following commands are executed:

while (B=0);
print A

Is it enough to add only lfence between the while and print commands in CPU1 without adding a sfence between A=1 and B=1 in CPU0 to get 1 always printed out on x86?

while (B=0);
lfence
print A

Even if x86 guarantees it, why take the risk? Why not just use the right barriers? — Zan Lynx
Zan, it can be of advantage at many places if the CPU guarantees that. For example spinlocks are implemented without using any lock prefixes in kernel because they can afford it. And fence is not the solution to this question otherwise, one needs to use a proper lock. — Saurabh
The answer to the 1-st question - Yes. The answer to the 2-nd question - Yes, but only in assembler (not in C / C++). As rightly said, LFENCE here is not needed on x86 - it provides acquire consistency automatically. Note that the x86 CPU can't reorder load and any next instructions, but C/C++ can reorder it. On C++ you shold use acquire consistency: extern std::atomic<int> B; while( B.load(std::memory_order_acquire) == 0 ); std::cout << A; en.cppreference.com/w/cpp/atomic/memory_order — Alex
Unless B is marked volatile a compiler may be allowed to convert while (B==0) to while (true) because as far as the compiler sees, nothing can change value of B within that loop. For example, C/C++ compilers are allowed to do this with high optimization levels. — Mikko Rantalainen

srking srking · Accepted Answer · 2012-01-16T16:56:22

In x86, writes by a single processor are observed in the same order by all processors. No need to fence in your example, nor in any normal program on x86. Your program:

while(B==0);  // wait for B == 1 to become globally observable
print A;      // now, A will always be 1 here

What exactly happens in cache is model specific. All kinds of tricks and speculative behavior can occur in cache, but the observable behavior always follows the rules.

See Intel System Programming Guide Volume 3 section 8.2.2. for the details on memory ordering.

Are two consequent CPU stores on x86 flushed to the cache keeping the order?

1 Answers