9
votes

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:

#include <emmintrin.h>
void setbytes(char *p, int c)
{
    __m128i i = _mm_set_epi8(c, c, c, c,
    c, c, c, c,
    c, c, c, c,
    c, c, c, c);
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);
}

With such a comment right below it:

Assuming the pointer p is appropriately aligned, a call to this function will set all bytes of the addressed cache line to c. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon.

What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?

3
Do you have a reference for your assumption on SSE cache operation? The Intel documentation specifies pollution which is what Ulrich references in the comment.Steve-o
My knowledge is all from the Ulrich's paper. Earlier in the chapter he writes: "These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.". It's from the second paragraph of part '6.1 Bypassing the Cache'Pawel Batko
It isn't clear to me what he's trying to say, but MOVNTDQ does update the cache if it happens to contain the address.Hans Passant
@HansPassant: movntdq can hit in cache, but it evicts the line from cache if it was present, according to Intel manual vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data. I guess this design-decision was made so drivers can avoid a clflush after an NT store to video memory or something. (IIRC, the doc says that this guaranteed eviction didn't happen on the earlier CPUs to support the instruction.)Peter Cordes
_mm_set1_epi8(c) would be a much easier way of broadcasting a byte than typing c 16 times.Peter Cordes

3 Answers

3
votes

When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.

When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.

The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.

The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware): - Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.

The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.

Note that if the cache line written is already in the main caches, the above method will also update them.

If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.

If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).

Typically the non-temporal cache size is on the order of 4-8 cache lines.

To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.

1
votes

I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.

This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)

A cache-line of data is still a cache-line even if it's not currently hot in any caches.

-2
votes

I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.