Acquire/release semantics with non-temporal stores on x64

Question

I have something like:

if (f = acquire_load() == ) {
   ... use Foo
}

and:

auto f = new Foo();
release_store(f)

You could easily imagine an implementation of acquire_load and release_store that uses atomic with load(memory_order_acquire) and store(memory_order_release). But now what if release_store is implemented with _mm_stream_si64, a non-temporal write, which is not ordered with respect to other stores on x64? How to get the same semantics?

I think the following is the minimum required:

atomic<Foo*> gFoo;

Foo* acquire_load() {
    return gFoo.load(memory_order_relaxed);
}

void release_store(Foo* f) {
   _mm_stream_si64(*(Foo**)&gFoo, f);
}

And use it as so:

// thread 1
if (f = acquire_load() == ) {
   _mm_lfence(); 
   ... use Foo
}

and:

// thread 2
auto f = new Foo();
_mm_sfence(); // ensures Foo is constructed by the time f is published to gFoo
release_store(f)

Is that correct? I'm pretty sure the sfence is absolutely required here. But what about the lfence? Is it required or would a simple compiler barrier be enough for x64? e.g. asm volatile("": : :"memory"). According the the x86 memory model, loads are not re-ordered with other loads. So to my understanding, acquire_load() must happen before any load inside the if statement, as long as there's a compiler barrier.

Btw, none of the SIMD load/stores (even when aligned) guarantee atomicity. — Mysticial
@Mysticial _mm_stream_si64 generates the movnti instruction, which while being sse2, is a 64bit store. There's a 32bit one as well. They must guarantee atomicity - there's almost no sane CPU architecture for a 64bit CPU where they wouldn't. — Eloff
Thanks @EOF. Maybe I'm reading ensures Foo is constructed by the time f is visible incorrectly. Either way, a barrier should be enough, IMO. AFAIK fences are needed only for cross-cpu effects. Writes from a single will keep order. — BitWhistler
I believe the lfence is required only if your "use f..." is using nt loads. @EOF, I believe that to be sane, library functions that use nt stores wash them down with sfence. You can verify with the impl of a modern memset — BitWhistler
@jrh: C11 has an equivalent stdatomic with the same memory_order_acquire and so on. The actual syntax isn't relevant to what the OP is asking about, which is how x86 NT stores interact with C11 / C++11 memory ordering semantics. You could write equivalent code in C, just with different syntax. Still, there are only room for 5 tags, and stdatomic is probably more important to have than c. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2016-02-23T07:33:45

I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.

The normal way to use NT stores is to do a bunch of them in a row, like as part of a memset or memcpy, then an SFENCE, then a normal release store to a shared flag variable: done_flag.store(1, std::memory_order_release).

Using a movnti store to the synchronization variable will hurt performance. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (movnt stores evict the cache line if it was in cache to start with; see vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data).

The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read by other cores.

Your function names also don't really reflect what you're doing.

x86 hardware is extremely heavily optimized for doing normal (not NT) release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.

Using normal stores/loads only requires a trip to L3 cache, not to DRAM, for communication between threads on Intel CPUs. Intel's large inclusive L3 cache works as a backstop for cache-coherency traffic. Probing the L3 tags on a miss from one core will detect the fact that another core has the cache line in the Modified or Exclusive state. NT stores would require synchronization variables to go all the way out to DRAM and back for another core to see it.

Memory ordering for NT streaming stores

movnt stores can be reordered with other stores, but not with older reads.

Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families):

Reads are not reordered with other reads.

Writes are not reordered with older reads. (note the lack of exceptions).

Writes to memory are not reordered with other writes, with the following exceptions:
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1). (note: From my reading of the docs, fast string and ERMSB ops still implicitly have a StoreStore barrier at the start/end. There's only potential reordering between the stores within a single rep movs or rep stos.)

... stuff about clflushopt and the fence instructions

update: There's also a note (in 8.1.2.2 Software Controlled Bus Locking) that says:

Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore.

This may just be a performance suggestion; they don't explain whether it can cause a correctness problem. Note that NT stores are not cache-coherent, though (data can sit in the line fill buffer even if conflicting data for the same line is present somewhere else in the system, or in memory). Maybe you could safely use NT stores as a release-store that synchronizes with regular loads, but would run into problems with atomic RMW ops like lock add dword [mem], 1.

Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.

To block reordering with earlier stores, we need an SFENCE instruction, which is a StoreStore barrier even for NT stores. (And is also a barrier to some kinds of compile-time reordering, but I'm not sure if it blocks earlier loads from crossing the barrier.) Normal stores don't need any kind of barrier instruction to be release-stores, so you only need SFENCE when using NT stores.

For loads: The x86 memory model for WB (write-back, i.e. "normal") memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier before the NT store. In gcc's implementation at least, std::atomic_signal_fence(std::memory_order_release) is a compiler-barrier even for non-atomic loads/stores, but atomic_thread_fence is only a barrier for atomic<> loads/stores (including mo_relaxed). Using an atomic_thread_fence still allows the compiler more freedom to reorder loads/stores to non-shared variables. See this Q&A for more.

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void NT_release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  Not needed: this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier for earlier atomic<> ops
   _mm_sfence();  // make sure all writes to the locked region are already globally visible, and don't reorder with the NT store
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.

To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);
}

You didn't say anything about storing gFoo in weakly-ordered WC (uncacheable write-combining) memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some WC video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.

Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)

Related: NT loads

In x86, every load has acquire semantics, except for loads from WC memory. SSE4.1 MOVNTDQA is the only non-temporal load instruction, and it isn't weakly ordered when used on normal (WriteBack) memory. So it's an acquire-load, too (when used on WB memory).

Note that movntdq only has a store form, while movntdqa only has a load form. But apparently Intel couldn't just call them storentdqa and loadntdqa. They both have a 16B or 32B alignment requirement, so leaving off the a doesn't make a lot of sense to me. I guess SSE1 and SSE2 had already introduced some NT stores already using the mov... mnemonic (like movntps), but no loads until years later in SSE4.1. (2nd-gen Core2: 45nm Penryn).

The docs say MOVNTDQA doesn't change the ordering semantics for the memory type it's used on.

... An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.

A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type.

In practice, current Intel mainsream CPUs (Haswell, Skylake) seem to ignore the hint for PREFETCHNTA and MOVNTDQA loads from WB memory. See Do current x86 architectures support non-temporal loads (from "normal" memory)?, and also Non-temporal loads and the hardware prefetcher, do they work together? for more details.

Also, if you are using it on WC memory (e.g. copying from video RAM, like in this Intel guide):

Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC memory locations or in order to synchronize reads of a processor with writes by other agents in the system.

That doesn't spell out how it should be used, though. And I'm not sure why they say MFENCE rather than LFENCE for reading. Maybe they're talking about a write-to-device-memory, read-from-device-memory situation where stores have to be ordered with respect to loads (StoreLoad barrier), not just with each other (StoreStore barrier).

I searched in Vol3 for movntdqa, and didn't get any hits (in the whole pdf). 3 hits for movntdq: All the discussion of weak ordering and memory types only talks about stores. Note that LFENCE was introduced long before SSE4.1. Presumably it's useful for something, but IDK what. For load ordering, probably only with WC memory, but I haven't read up on when that would be useful.

LFENCE appears to be more than just a LoadLoad barrier for weakly-ordered loads: it orders other instructions too. (Not the global-visibility of stores, though, just their local execution).

From Intel's insn ref manual:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruc- tion begins execution until LFENCE completes.
...
Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

The entry for rdtsc suggests using LFENCE;RDTSC to prevent it from executing ahead of previous instructions, when RDTSCP isn't available (and the weaker ordering guarantee is ok: rdtscp doesn't stop following instructions from executing ahead of it). (CPUID is a common suggestion for a serializing the instruction stream around rdtsc).

Acquire/release semantics with non-temporal stores on x64

1 Answers

Memory ordering for NT streaming stores

To do an acquire-load, just tell the compiler you want one.

Related: NT loads