I have written the following code which first flushes two array elements and then tries to read elements in order to measure the hit/miss latencies.
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
/* create array */
int array[ 100 ];
int i;
for ( i = 0; i < 100; i++ )
array[ i ] = i; // bring array to the cache
uint64_t t1, t2, ov, diff1, diff2, diff3;
/* flush the first cache line */
_mm_lfence();
_mm_clflush( &array[ 30 ] );
_mm_clflush( &array[ 70 ] );
_mm_lfence();
/* READ MISS 1 */
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
int tmp = array[ 30 ]; // read the first elemet => cache miss
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff1 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );
/* READ MISS 2 */
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
tmp = array[ 70 ]; // read the second elemet => cache miss (or hit due to prefetching?!)
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff2 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );
/* READ HIT*/
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
tmp = array[ 30 ]; // read the first elemet => cache hit
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff3 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );
/* measuring fence overhead */
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "cache miss1 TSC is %lu\n", diff1-ov );
printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
printf( "cache hit TSC is %lu\n", diff3-ov );
return 0;
}
And output is
# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12
There are some problems with the output for reading array[70]
. The TSC is neither hit nor miss. I had flushed that item similar to array[30]
. One possibility is that when array[40]
is accessed, the HW prefetcher brings array[70]
. So, that should be a hit. However, the TSC is much more than a hit. You can verify that the hit TSC is about 20 when I try to read array[30]
for the second time.
Even, if array[70]
is not prefetched, the TSC should be similar to a cache miss.
Is there any reason for that?
UPDATE1:
In order to make an array read, I tried (void) *((int*)array+i)
as suggested by Peter and Hadi.
In the output I see many negative results. I mean the overhead seems to be larger than (void) *((int*)array+i)
UPDATE2:
I forgot to add volatile
. The results are now meaningful.
volatile
and the value isn't used (optimiser would/should ignore it completely); and the cost of anlfence
depends on the surrounding code (e.g. how many loads were in flight at the time) and can't be measured under one set of conditions and assumed to be the same for a different set of conditions. – Brendanvolatile
. Thanks. – mahmood