In Agner Fog's "Optimizing subroutines in assembly language - section 11.8 Cache control instructions," he says: "Memory writes are more expensive than reads when cache misses occur in a write-back cache. A whole cache line has to be read from memory, modified, and written back in case of a cache miss. This can be avoided by using the non-temporal write instructions MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS. These instructions should be used when writing to a memory location that is unlikely to be cached and unlikely to be read from again before the would-be cache line is evicted. As a rule of thumb, it can be recommended to use non-temporal writes only when writing a memory block that is bigger than half the size of the largest-level cache."
From the "Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes Oct 2019" - "These SSE and SSE2 non-temporal store instructions minimize cache pollution by treating the memory being accessed as the write combining (WC) type. If a program specifies a non-temporal store with one of these instructions and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following . . . "
I thought that write-combining memory is only found in graphics cards but not in general-purpose heap memory -- and by extension that the instructions listed above would only be useful in such cases. If that's true, why would Agner Fog recommend those instructions? The Intel manual seems to suggest that it's only useful with WB, WT or WC memory, but then they say that the memory being accessed will be treated as WC.
If those instructions actually can be used in an ordinary write to heap memory, are there any limitations? How do I allocate write-combining memory?