How do non temporal instructions work?

Question

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:

#include <emmintrin.h>
void setbytes(char *p, int c)
{
    __m128i i = _mm_set_epi8(c, c, c, c,
    c, c, c, c,
    c, c, c, c,
    c, c, c, c);
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);
}

With such a comment right below it:

Assuming the pointer p is appropriately aligned, a call to this function will set all bytes of the addressed cache line to c. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon.

What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?

Do you have a reference for your assumption on SSE cache operation? The Intel documentation specifies pollution which is what Ulrich references in the comment. — Steve-o
My knowledge is all from the Ulrich's paper. Earlier in the chapter he writes: "These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.". It's from the second paragraph of part '6.1 Bypassing the Cache' — Pawel Batko
It isn't clear to me what he's trying to say, but MOVNTDQ does update the cache if it happens to contain the address. — Hans Passant
@HansPassant: movntdq can hit in cache, but it evicts the line from cache if it was present, according to Intel manual vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data. I guess this design-decision was made so drivers can avoid a clflush after an NT store to video memory or something. (IIRC, the doc says that this guaranteed eviction didn't happen on the earlier CPUs to support the instruction.) — Peter Cordes
_mm_set1_epi8(c) would be a much easier way of broadcasting a byte than typing c 16 times. — Peter Cordes

Nicholas Frechette Nicholas Frechette · Accepted Answer · 2016-05-01T20:09:57

When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.

When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.

The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.

The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware): - Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.

The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.

Note that if the cache line written is already in the main caches, the above method will also update them.

If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.

If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).

Typically the non-temporal cache size is on the order of 4-8 cache lines.

To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.

How do non temporal instructions work?

3 Answers