Does _mm_clflush really flush the cache?

Question

I'm trying to understand how the hardware cache works by writing and running a test program:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64

#define L1_WAYS     8
#define L1_SETS     64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    int junk = 0;
    register uint64_t t1, t2;

    printf("data: %p\n", data);

    //_mm_clflush(data);
    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I run the code with and w/o the _mm_clflush, while the results just show with _mm_clflush the first memory access is faster.

with _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 280
i =  1, cycles: 84
i =  2, cycles: 91
i =  3, cycles: 77
i =  4, cycles: 91

w/o _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 3899
i =  1, cycles: 91
i =  2, cycles: 105
i =  3, cycles: 77
i =  4, cycles: 84

It just does not make sense you flush the cache line, but actually get faster? Could anyone explain why this happens? Thanks

----------------Further experiment-------------------

Let's assume the 3899 cycles is caused by TLB miss. To prove my knowledge of cache hit/miss, I slightly modified this code to compare the memory access time in case of L1 cache hit and L1 cache miss.

This time, the code skips the cache line size (64 bytes) and accesses the next memory address.

*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i*LINE_SIZE];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

Since my computer has an 8-way set associate L1 data cache, with 64 sets, totally 32KB. If I access memory every 64 bytes, it should cause all the cache misses. But it seems there are a lot of cache lines that have already cached:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 273
i =  1, cycles: 70
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 70
i =  5, cycles: 70
i =  6, cycles: 70
i =  7, cycles: 70
i =  8, cycles: 70
i =  9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i =  0, cycles: 301
i =  1, cycles: 133
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 147
i =  5, cycles: 56
i =  6, cycles: 70
i =  7, cycles: 63
i =  8, cycles: 70
i =  9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105

Is this caused by the prefetch? Or is there anything wrong with my understanding? Thanks

You can't measure a few cycles granularity with rdtsc (even the rdtscp flavor), you're probably affecting the performance more than the load itself. You need to amortize over multiple lines. — Leeor
Yes, clflush does flush the cache line if it's present in any cache. See clflush to invalidate cache line via C function for a working program that measures cache hit vs. L3 miss latency. — Peter Cordes
@Leeor Do you mean this measurement is not accurate because of the cycles used by rdtscp function call? Actually, I am studying the cache side channel. I guess this could be a noticeable result to distinguish L1 hit, L2 hit and cache miss. — xiaogw
Thanks, @PeterCordes. I updated my question to see what is the difference between cache line hit and cache line miss. Just a little bit difficult to explain the result. Could you answer that? — xiaogw
For cache-read side channel, see: How can I create a spectre gadget in practice?. If I get around to it later, I'll read you'll whole question in more detail. I don't have time right now. — Peter Cordes

Luo Yongping Luo Yongping · Accepted Answer · 2019-07-15T03:32:03

I modified the code by adding a write before _mm_clflush(data) and it shows clflush does flush the cache line. The modified code:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    unsigned int junk = 0;
    register uint64_t t1, t2;

    data[0] = 1; //write before cflush
    //_mm_clflush(data);

    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I ran the modified code on my computer(Intel(R) Core(TM) i5-8500 CPU) and got result below. The first access to data which flushed to memory before have a apparently higher latency then that who do not, according to several tries.

without clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 64
i =  1, cycles: 46
i =  2, cycles: 49
i =  3, cycles: 48
i =  4, cycles: 46

with clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 214
i =  1, cycles: 41
i =  2, cycles: 40
i =  3, cycles: 42
i =  4, cycles: 40

Does _mm_clflush really flush the cache?

3 Answers