Whiskey Lake i7-8565U
I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.
Declaration:
void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
Definition:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
Benchmark code:
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"
#define BUF_SIZE 128 * 1024 * 1024
_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);
uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
generate_data(src, sizeof src);
warmup(4, copy_fn);
cache_flush(src, sizeof src);
cache_flush(dest, sizeof dest);
__asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory");
uint64_t cycles_start = __rdpmc((1 << 30) + 1);
copy_fn(dest, src, sizeof src);
__asm__ __volatile__("lfence" ::: "memory");
uint64_t cycles_end = __rdpmc((1 << 30) + 1);
return cycles_end - cycles_start;
}
int main(void){
uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
while(wa_iterations --> 0){
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
}
}
static inline void generate_data(char *buf, size_t sz){
int fd = open("/dev/urandom", O_RDONLY);
read(fd, buf, sz);
}
static inline void cache_flush(char *buf, size_t sz){
for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
_mm_clflush(buf + i);
}
}
Results:
avx_memcpy_forward_llss median: 44479368 core cycles
UPD: time
real 0m0,217s
user 0m0,093s
sys 0m0,124s
avx_nt_memcpy_forward_llss median: 24053086 core cycles
UPD: time
real 0m0,184s
user 0m0,056s
sys 0m0,128s
UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin
So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):
Although the data bandwidth of full 64-byte bus writes due to non-temporal stores is twice that of bus writes to WB memory, transferring 8-byte chunks wastes bus request bandwidth and delivers significantly lower data bandwidth.
QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.
QUESTION 2: Is such benchmark correct. I accounted cpuid in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence to avoid rdpmc to be executed OoO.
copy_fninwarmup? This function enables you to remove the overhead of page faults that will occur when writing todest, but I'm not sure why you need to callcopy_fnso many times. Another thing,lfencewill not ensure that all previous stores have become globally observable. You can usemfenceinstead. Although on your processor, later NT loads from WC memory may passmfence(in contrast tolfence). Also I think you need a compiler barrier just after the first__rdpmc. - Hadi Braislfencein order to serialize it to be executed after rdpmc, but software must insert a serializing instruction (such as the CPUID instruction) before and/or after the RDPMC instruction solfenceseems to be not enough. Also I think you need a compiler barrier just after the first__rdpmcGood point. Why are there so many invocations of copy_fn in warmup? I thought it was required to warm uops cache up. But the amount of invocation was picked up at random. - St.Antario