I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:
#include <emmintrin.h>
void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}
With such a comment right below it:
Assuming the pointer
p
is appropriately aligned, a call to this function will set all bytes of the addressed cache line toc
. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon.
What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?
movntdq
can hit in cache, but it evicts the line from cache if it was present, according to Intel manual vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data. I guess this design-decision was made so drivers can avoid aclflush
after an NT store to video memory or something. (IIRC, the doc says that this guaranteed eviction didn't happen on the earlier CPUs to support the instruction.) – Peter Cordes_mm_set1_epi8(c)
would be a much easier way of broadcasting a byte than typingc
16 times. – Peter Cordes