I'm trying to understand how the hardware cache works by writing and running a test program:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#define LINE_SIZE 64
#define L1_WAYS 8
#define L1_SETS 64
#define L1_LINES 512
// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];
int main()
{
volatile uint8_t *addr;
register uint64_t i;
int junk = 0;
register uint64_t t1, t2;
printf("data: %p\n", data);
//_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
}
I run the code with and w/o the _mm_clflush
, while the results just show with _mm_clflush
the first memory access is faster.
with _mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 280
i = 1, cycles: 84
i = 2, cycles: 91
i = 3, cycles: 77
i = 4, cycles: 91
w/o _mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 3899
i = 1, cycles: 91
i = 2, cycles: 105
i = 3, cycles: 77
i = 4, cycles: 84
It just does not make sense you flush the cache line, but actually get faster? Could anyone explain why this happens? Thanks
----------------Further experiment-------------------
Let's assume the 3899 cycles is caused by TLB miss. To prove my knowledge of cache hit/miss, I slightly modified this code to compare the memory access time in case of L1 cache hit
and L1 cache miss
.
This time, the code skips the cache line size (64 bytes) and accesses the next memory address.
*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i*LINE_SIZE];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
Since my computer has an 8-way set associate L1 data cache, with 64 sets, totally 32KB. If I access memory every 64 bytes, it should cause all the cache misses. But it seems there are a lot of cache lines that have already cached:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 273
i = 1, cycles: 70
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 70
i = 5, cycles: 70
i = 6, cycles: 70
i = 7, cycles: 70
i = 8, cycles: 70
i = 9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i = 0, cycles: 301
i = 1, cycles: 133
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 147
i = 5, cycles: 56
i = 6, cycles: 70
i = 7, cycles: 63
i = 8, cycles: 70
i = 9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105
Is this caused by the prefetch? Or is there anything wrong with my understanding? Thanks
clflush
does flush the cache line if it's present in any cache. See clflush to invalidate cache line via C function for a working program that measures cache hit vs. L3 miss latency. – Peter Cordes