As a follow-up to this topic, in order to calculate the memory miss latency, I have wrote the following code using _mm_clflush
, __rdtsc
and _mm_lfence
(which is based on the code from this question/answer).
As you can see in the code, I first load the array into the cache. Then I flush one element and therefore the cache line is evicted from all cache levels. I put _mm_lfence
in order to preserve the order during -O3
.
Next, I used time stamp counter to calculate the latency or reading array[0]
. As you can see between two time stamps, there are three instructions: two lfence
and one read
. So, I have to subtract the lfence
overhead. The last section of the code calculates that overhead.
At the end of the code, the overhead and miss latency are printed. However, the result is not valid!
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
int main()
{
int array[ 100 ];
for ( int i = 0; i < 100; i++ )
array[ i ] = i;
uint64_t t1, t2, ov, diff;
_mm_lfence();
_mm_clflush( &array[ 0 ] );
_mm_lfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff = t2 - t1;
printf( "diff is %lu\n", diff );
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "miss cycles is %lu\n", diff-ov );
return 0;
}
However, the output is not valid
$ gcc -O3 -o flush1 flush1.c
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 147
miss cycles is 14
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 154
miss cycles is 7
$ taskset -c 0 ./flush1
diff is 147
lfence overhead is 154
miss cycles is 18446744073709551609
Any thought?
Next, I tried clock_gettime
function in order to calculate the miss latency as below
_mm_lfence();
_mm_clflush( &array[ 0 ] );
_mm_lfence();
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
clock_gettime(CLOCK_MONOTONIC, &end);
diff = 1000000000 * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("miss elapsed time = %lu nanoseconds\n", diff);
The output is miss elapsed time = 578 nanoseconds
. Is that reliable?
UPDATE1:
Thanks to Peter and Hadi, to summarize the responses till now, I found out
1- Unused variables are omitted in the optimization phase and that was the reason on weird values I seen in the output. Thanks to Peter's reply, there are some ways to fix that.
2-
clock_gettime
is not suitable for such resolution and that function is used for larger delays.
As a workaround, I tried to bring the array in to the cache and then flush all elements to be sure that all elements are evicted from all cache levels. Then I measured the latency of array[0]
and then array[20]
. Since each element is 4-bytes, the distance is 80 bytes. I expect to get two cache misses. However, the latency of array[20]
is similar to a cache hit. A safe guess is that the cache line is not 80 bytes. So, maybe array[20]
is prefetched by hardware. Not always, but I also see some odd results again
for ( int i = 0; i < 100; i++ ) {
_mm_lfence();
_mm_clflush( &array[ i ] );
_mm_lfence();
}
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff1 = t2 - t1;
printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );
_mm_lfence();
t1 = __rdtsc();
tmp = array[ 20 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff2 = t2 - t1;
printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "TSC1 is %lu\n", diff1-ov );
printf( "TSC2 is %lu\n", diff2-ov );
Output is
$ ./flush1
tmp is 0
diff1 is 371
tmp is 20
diff2 is 280
lfence overhead is 147
TSC1 is 224
TSC2 is 133
$ ./flush1
tmp is 0
diff1 is 399
tmp is 20
diff2 is 280
lfence overhead is 154
TSC1 is 245
TSC2 is 126
$ ./flush1
tmp is 0
diff1 is 392
tmp is 20
diff2 is 840
lfence overhead is 147
TSC1 is 245
TSC2 is 693
$ ./flush1
tmp is 0
diff1 is 364
tmp is 20
diff2 is 140
lfence overhead is 154
TSC1 is 210
TSC2 is 18446744073709551602
The statement that "HW prefetcher brings other blocks" is about 80% correct then. What is the going on then? Any more accurate statement?
mov
instructions after the firstrdtsc
but before the 2nd, to save the results in registers that won't be overwritten. Or a shift + OR + mov, or shift + LEA. Get CPU cycle count?. Since you compiled with optimization enabled, that's probably all (and will run in parallel with the load), but don't make the mistake of forgetting talking about C like it's actual asm. – Peter Cordesint tmp = array[ 0 ]
becausetmp
is not used. Check the generated assembly code. Also, make sure you fix the core frequency. – Hadi Braisclock_gettime
cannot be used to measure the perf of such tiny pieces of code because its own overhead can be relatively massive. Also it only works at the nanoseconds granularity. – Hadi Brais