3
votes

I have written the following code which first flushes two array elements and then tries to read elements in order to measure the hit/miss latencies.

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 30 ];   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 70 ];      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff2 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 30 ];   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff3 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );


    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );


    return 0;
}

And output is

# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12

There are some problems with the output for reading array[70]. The TSC is neither hit nor miss. I had flushed that item similar to array[30]. One possibility is that when array[40] is accessed, the HW prefetcher brings array[70]. So, that should be a hit. However, the TSC is much more than a hit. You can verify that the hit TSC is about 20 when I try to read array[30] for the second time.

Even, if array[70] is not prefetched, the TSC should be similar to a cache miss.

Is there any reason for that?

UPDATE1:

In order to make an array read, I tried (void) *((int*)array+i) as suggested by Peter and Hadi.

In the output I see many negative results. I mean the overhead seems to be larger than (void) *((int*)array+i)

UPDATE2:

I forgot to add volatile. The results are now meaningful.

2
The compiler probably doesn't bother reading from the array because it's not volatile and the value isn't used (optimiser would/should ignore it completely); and the cost of an lfence depends on the surrounding code (e.g. how many loads were in flight at the time) and can't be measured under one set of conditions and assumed to be the same for a different set of conditions.Brendan
Yes. I forgot to add volatile. Thanks.mahmood

2 Answers

3
votes

First, note that the two calls to printf after measuring diff1 and diff2 may perturb the state of the L1D and even the L2. On my system, with printf, the reported values for diff3-ov range between 4-48 cycles (I've configured my system so that the TSC frequency is about equal to the core frequency). The most common values are those of the L2 and L3 latencies. If the reported value is 8, then we've got our L1D cache hit. If it is larger than 8, then most probably the preceding call to printf has kicked out the target cache line from the L1D and possibly the L2 (and in some rare cases, the L3!), which would explain the measured latencies that are higher than 8. @PeterCordes have suggested to use (void) *((volatile int*)array + i) instead of temp = array[i]; printf(temp). After making this change, my experiments show that most reported measurements for diff3-ov are exactly 8 cycles (which suggests that the measurement error is about 4 cycles), and the only other values that get reported are 0, 4, and 12. So Peter's approach is strongly recommended.

In general, the main memory access latency depends on many factors including the state of the MMU caches and the impact of the page table walkers on the data caches, the core frequency, the uncore frequency, the state and configuration of the memory controller and the memory chips with respect to the target physical address, uncore contention, and on-core contention due to hyperthreading. array[70] might be in a different virtual page (and physical page) than array[30] and their IPs of the load instructions and the addresses of the target memory locations may interact with the prefetchers in complex ways. So there can be many reasons why cache miss1 is different from cache miss2. A thorough investigation is possible, but it would require a lot of effort as you might imagine. Generally, if your core frequency is larger than 1.5 GHz (which is smaller than the TSC frequency on high-perf Intel processors), then an L3 load miss will take at least 60 core cycles. In your case, both miss latencies are over 100 cycles, so these are most likely L3 misses. In some extremely rare cases though, cache miss2 seems to be close to the L3 or L2 latency ranges, which would be due to prefetching.


I've determined that the following code gives a statistically more accurate measurement on Haswell:

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

The probability that loadlatency is 4 cycles is 97%. The probability that loadlatency is 8 cycles is 1.7%. The probability that loadlatency takes other values is 1.3%. All of the other values are larger than 8 and multiple of 4. I'll try to add an explanation later.

1
votes

Some ideas:

  • Perhaps a[70] was prefetched into some level of cache besides L1?
  • Perhaps some optimization in DRAM causes this access to be fast, for instance maybe the row buffer is left open after accessing a[30].

You should investigate other access besides a[30] and a[70] to see if you get different numbers. E.g. do you get the same timings for hit on a[30] followed by a[31] (which should be fetched in the same line as a[30], if you use aligned_alloc with 64 byte alignment). And do other elements like a[69] and a[71] give the same timings as a[70]?