Non-temporal loads and the hardware prefetcher, do they work together?

Question

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution?

The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache.

When sequentially iterating a large data structure (processed data won't be retouched in a long while), it would make sense to me to avoid polluting the chache hierarchy, but I do not want to incur in frequent ~100 cycle penalties because the pre-fetcher is idle.

Target architecture is Intel SandyBridge

Good question. There's a prefetchnta, but I forget what I've read about this case. — Peter Cordes
According to some older Intel docs, non-temporal loads are the same as normal aligned loads unless the memory is uncachable. My personal experience has confirmed that they make no performance difference on normal data. But this was back in the Nehalem/Sandy Bridge era. I have no idea if anything has changed for Haswell or Skylake. — Mysticial
@PeterCordes prefetchnta pulls into L1 cache only rather than all the caches. That said, I have no idea how it interacts with the hardware prefetcher. In cases where the memory access is "random enough" for the hardware prefetcher to fail, but "sequential enough" to use full cachelines (as is the case in a lot of cache-blocking optimizations), I've found that software prefetching makes a huge difference in the absence of Hyperthreading. (~10%) But I've seen no observable difference between prefetcht0 and prefetchnta. — Mysticial
@Mysticial: L3 is inclusive on recent Intel designs, so L3 tags can be used for cache coherency checks. A cache line present in L1 but not L3 could get stale if another core modified that cache line, but I think IA32's cache coherency model disallows this (so it can't be implemented this way). prefetchnta was introduced in PIII days, before multi-core CPUs. I wouldn't be at all surprised if it did exactly the same thing as prefetch0 on current designs, like how lddqu is now identical to movdqu. Perhaps prefetchnta makes cache lines more likely to be evicted again quickly. — Peter Cordes
@PeterCordes Thanks for that insight on the caches. I've never thought about this from the perspective of cache coherency. — Mysticial

Peter Cordes Peter Cordes · Accepted Answer · 2015-08-20T12:36:13

According to Patrick Fay (Intel)'s Nov 2011 post:, "On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels)." He also says you need to make sure you don't prefetch too late (HW prefetch will already have pulled it in to all levels), or too early (evicted by the time you get there).

As discussed in comments on the OP, current Intel CPUs have a large shared L3 which is inclusive of all the per-core caches. This means cache-coherency traffic only has to check L3 tags to see if a cache line might be modified somewhere in a per-core L1/L2.

IDK how to reconcile Pat Fay's explanation with my understanding of cache coherency / cache heirarchy. I thought if it does go in L1, it would also have to go in L3. Maybe L1 tags have some kind of flag to say this line is weakly-ordered? My best guess is he was simplifying, and saying L1 when it actually only goes in fill buffers.

This Intel guide about working with video RAM talks about non-temporal moves using load/store buffers, rather than cache lines. (Note that this may only the case for uncacheable memory.) It doesn't mention prefetch. It's also old, predating SandyBridge. However, it does have this juicy quote:

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

And then in another paragraph, says typical CPUs have 8 to 10 fill buffers. SnB/Haswell still have 10 per core.. Again, note that this may only apply to uncacheable memory regions.

movntdqa on WB (write-back) memory is not weakly-ordered (see the NT loads section of the linked answer), so it's not allowed to be "stale". Unlike NT stores, neither movntdqa nor prefetchnta change the memory ordering semantics of Write-Back memory.

I have not tested this guess, but prefetchnta / movntdqa on a modern Intel CPU could load a cache line into L3 and L1, but could skip L2 (because L2 isn't inclusive or exclusive of L1). The NT hint could have an effect by placing the cache line in the LRU position of its set, where it's the next line to be evicted. (Normal cache policy inserts new lines at the MRU position, farthest from being evicted. See this article about IvB's adaptive L3 policy for more about cache insertion policy).

Prefetch throughput on IvyBridge is only one per 43 cycles, so be careful not to prefetch too much if you don't want prefetches to slow down your code on IvB. Source: Agner Fog's insn tables and microarch guide. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).

About SW prefetching in general (not the nt kind): Linus Torvalds posted about how they rarely help in the Linux kernel, and often do more harm than good. Apparently prefetching a NULL pointer at the end of a linked-list can cause a slowdown, because it attempts a TLB fill.

Non-temporal loads and the hardware prefetcher, do they work together?

4 Answers

Skylake Server (SKLX)