Can we use non-temporal mov instructions on heap memory?

Question

In Agner Fog's "Optimizing subroutines in assembly language - section 11.8 Cache control instructions," he says: "Memory writes are more expensive than reads when cache misses occur in a write-back cache. A whole cache line has to be read from memory, modified, and written back in case of a cache miss. This can be avoided by using the non-temporal write instructions MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS. These instructions should be used when writing to a memory location that is unlikely to be cached and unlikely to be read from again before the would-be cache line is evicted. As a rule of thumb, it can be recommended to use non-temporal writes only when writing a memory block that is bigger than half the size of the largest-level cache."

From the "Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes Oct 2019" - "These SSE and SSE2 non-temporal store instructions minimize cache pollution by treating the memory being accessed as the write combining (WC) type. If a program specifies a non-temporal store with one of these instructions and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following . . . "

I thought that write-combining memory is only found in graphics cards but not in general-purpose heap memory -- and by extension that the instructions listed above would only be useful in such cases. If that's true, why would Agner Fog recommend those instructions? The Intel manual seems to suggest that it's only useful with WB, WT or WC memory, but then they say that the memory being accessed will be treated as WC.

If those instructions actually can be used in an ordinary write to heap memory, are there any limitations? How do I allocate write-combining memory?

Peter Cordes Peter Cordes · Accepted Answer · 2020-03-25T01:18:11

You can use NT stores like movntps on normal WB memory (i.e. the heap). See also Enhanced REP MOVSB for memcpy for more about NT stores vs. normal stores.

It treats it as WC for the purposes of those NT stores, despite the MTRR and/or PAT having it set to normal WB.

The Intel docs are telling you that NT stores "work" on WB, WT, and WC memory. (But not strongly-ordered UC uncacheable memory, and of course not on WP write-protected memory).

You are correct that normally only video RAM (or possibly other similar device-memory regions) are mapped WC. And no, you can't easily allocate WC memory in a user-space process under a normal OS like Linux, but you wouldn't normally want to.

You can only use SSE4 NT loads on WC memory (otherwise current CPUs ignore the NT hint), but some cache pollution for loads is a small price to pay for HW prefetch and caching working. You can use NT prefetch from WB memory to reduce pollution in some levels of cache, e.g. bypassing L2. But that's hard to tune.

IIRC, normal stores like mov on WC memory have the store-merging behaviour you get from NT stores. But you don't need to use WC memory for NT stores to work.

Can we use non-temporal mov instructions on heap memory?

1 Answers