prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches into an LFB (Line Fill Buffer), allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. (movntdqa
loads from WC memory do work by pulling data into an LFB, according to Intel. That's how multiple movntdqa
loads on the same "cache line" can avoid multiple actual DRAM reads or PCIe transactions). See also Non-temporal loads and the hardware prefetcher, do they work together? - no, not HW prefetch.
But note that movntdqa
from WB memory is not useful. It just works like an ordinary load (plus an ALU uop for some reason).