6
votes

I'm trying to understand how the hardware cache works by writing and running a test program:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64

#define L1_WAYS     8
#define L1_SETS     64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    int junk = 0;
    register uint64_t t1, t2;

    printf("data: %p\n", data);

    //_mm_clflush(data);
    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I run the code with and w/o the _mm_clflush, while the results just show with _mm_clflush the first memory access is faster.

with _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 280
i =  1, cycles: 84
i =  2, cycles: 91
i =  3, cycles: 77
i =  4, cycles: 91

w/o _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 3899
i =  1, cycles: 91
i =  2, cycles: 105
i =  3, cycles: 77
i =  4, cycles: 84

It just does not make sense you flush the cache line, but actually get faster? Could anyone explain why this happens? Thanks

----------------Further experiment-------------------

Let's assume the 3899 cycles is caused by TLB miss. To prove my knowledge of cache hit/miss, I slightly modified this code to compare the memory access time in case of L1 cache hit and L1 cache miss.

This time, the code skips the cache line size (64 bytes) and accesses the next memory address.

*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i*LINE_SIZE];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

Since my computer has an 8-way set associate L1 data cache, with 64 sets, totally 32KB. If I access memory every 64 bytes, it should cause all the cache misses. But it seems there are a lot of cache lines that have already cached:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 273
i =  1, cycles: 70
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 70
i =  5, cycles: 70
i =  6, cycles: 70
i =  7, cycles: 70
i =  8, cycles: 70
i =  9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i =  0, cycles: 301
i =  1, cycles: 133
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 147
i =  5, cycles: 56
i =  6, cycles: 70
i =  7, cycles: 63
i =  8, cycles: 70
i =  9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105

Is this caused by the prefetch? Or is there anything wrong with my understanding? Thanks

3
You can't measure a few cycles granularity with rdtsc (even the rdtscp flavor), you're probably affecting the performance more than the load itself. You need to amortize over multiple lines.Leeor
Yes, clflush does flush the cache line if it's present in any cache. See clflush to invalidate cache line via C function for a working program that measures cache hit vs. L3 miss latency.Peter Cordes
@Leeor Do you mean this measurement is not accurate because of the cycles used by rdtscp function call? Actually, I am studying the cache side channel. I guess this could be a noticeable result to distinguish L1 hit, L2 hit and cache miss.xiaogw
Thanks, @PeterCordes. I updated my question to see what is the difference between cache line hit and cache line miss. Just a little bit difficult to explain the result. Could you answer that?xiaogw
For cache-read side channel, see: How can I create a spectre gadget in practice?. If I get around to it later, I'll read you'll whole question in more detail. I don't have time right now.Peter Cordes

3 Answers

2
votes

I modified the code by adding a write before _mm_clflush(data) and it shows clflush does flush the cache line. The modified code:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    unsigned int junk = 0;
    register uint64_t t1, t2;

    data[0] = 1; //write before cflush
    //_mm_clflush(data);

    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I ran the modified code on my computer(Intel(R) Core(TM) i5-8500 CPU) and got result below. The first access to data which flushed to memory before have a apparently higher latency then that who do not, according to several tries.

without clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 64
i =  1, cycles: 46
i =  2, cycles: 49
i =  3, cycles: 48
i =  4, cycles: 46

with clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 214
i =  1, cycles: 41
i =  2, cycles: 40
i =  3, cycles: 42
i =  4, cycles: 40
1
votes

I guess it might be caused by a TLB miss at the beginning? And _mm_clflush actually cache this virtual address into TLB, am I possibly right? How can prove it?

1
votes

Without clflush, the first load takes about 3899 cycles, which is about the amount of time it takes to handle a minor page fault. rdtscp serializes the load operations, thereby ensuring that all later loads to the same line hit in the L1 cache. Now when you add clflush just before the loop, the page fault is triggered and handled outside the loop. When the page fault handler returns and clflush gets re-executed, the target cache line gets flushed. On Intel processors, rdtscp ensures that the line is flushed before the first load in the loop is issued. Therefore, the first load misses in the cash hierarchy and its latency would be about that of a memory access. Just like the previous case, later loads are serialized by rdtscp and so they all hit in the L1D.

The measured L1D hit latencies are too high, though, even if we consider the overhead of rdtscp. Did you compile with -O3?

I wasn't able to reproduce your results (i.e., the minor page fault) with gcc 5.5.0 on Linux 4.4.0-154 when the cache line is allocated statically, but only when I use mmap. If you tell me your compiler version and kernel version, maybe I can investigate this further.

Regarding your second question, the way you're measuring load latency will not enable you to distinguish between an L1D hit and an L2 hit because the error in measurement could be as large as the difference in latencies. You can use the MEM_LOAD_UOPS_RETIRED.L1_HIT and MEM_LOAD_UOPS_RETIRED.L2_HIT performance counters to check. A sequential access pattern is pretty easily detectable by the L1 and L2 hardware prefetchers, so it's not surprising to get hits if you don't switch the prefetchers off.