Benchmarking memory copy in a single shot

Question

Whiskey Lake i7-8565U

I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.

Declaration:

void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);

void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);

Definition:

avx_memcpy_forward_llss:
    shr rdx, 0x3
    xor rcx, rcx
avx_memcpy_forward_loop_llss:
    vmovdqa ymm0, [rsi + 8*rcx]
    vmovdqa ymm1, [rsi + 8*rcx + 0x20]
    vmovdqa [rdi + rcx*8], ymm0
    vmovdqa [rdi + rcx*8 + 0x20], ymm1
    add rcx, 0x08
    cmp rdx, rcx
    ja avx_memcpy_forward_loop_llss
    ret

avx_nt_memcpy_forward_llss:
    shr rdx, 0x3
    xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
    vmovdqa ymm0, [rsi + 8*rcx]
    vmovdqa ymm1, [rsi + 8*rcx + 0x20]
    vmovntdq [rdi + rcx*8], ymm0
    vmovntdq [rdi + rcx*8 + 0x20], ymm1
    add rcx, 0x08
    cmp rdx, rcx
    ja avx_nt_memcpy_forward_loop_llss
    ret

Benchmark code:

#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"

#define BUF_SIZE 128 * 1024 * 1024

_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];

static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);

uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
    generate_data(src, sizeof src);
    warmup(4, copy_fn); 
    cache_flush(src, sizeof src);
    cache_flush(dest, sizeof dest);
    __asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory"); 
    uint64_t cycles_start = __rdpmc((1 << 30) + 1); 
    copy_fn(dest, src, sizeof src); 
    __asm__ __volatile__("lfence" ::: "memory"); 
    uint64_t cycles_end = __rdpmc((1 << 30) + 1); 
    return cycles_end - cycles_start; 
}

int main(void){
    uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
    printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}

static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
    while(wa_iterations --> 0){
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
        copy_fn(dest, src, sizeof src);
    }
}

static inline void generate_data(char *buf, size_t sz){
    int fd = open("/dev/urandom", O_RDONLY);
    read(fd, buf, sz);
}

static inline void cache_flush(char *buf, size_t sz){
    for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
        _mm_clflush(buf + i);
    }
}

Results:

avx_memcpy_forward_llss median: 44479368 core cycles

UPD: time

real    0m0,217s
user    0m0,093s
sys     0m0,124s

avx_nt_memcpy_forward_llss median: 24053086 core cycles

UPD: time

real    0m0,184s
user    0m0,056s
sys     0m0,128s

UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin

So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):

Although the data bandwidth of full 64-byte bus writes due to non-temporal stores is twice that of bus writes to WB memory, transferring 8-byte chunks wastes bus request bandwidth and delivers significantly lower data bandwidth.

QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.

QUESTION 2: Is such benchmark correct. I accounted cpuid in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence to avoid rdpmc to be executed OoO.

Why are there so many invocations of copy_fn in warmup? This function enables you to remove the overhead of page faults that will occur when writing to dest, but I'm not sure why you need to call copy_fn so many times. Another thing, lfence will not ensure that all previous stores have become globally observable. You can use mfence instead. Although on your processor, later NT loads from WC memory may pass mfence (in contrast to lfence). Also I think you need a compiler barrier just after the first __rdpmc. — Hadi Brais
@HadiBrais Another thing, lfence will not ensure that all previous stores have become globally observable. Agree. I inserted lfence in order to serialize it to be executed after rdpmc, but software must insert a serializing instruction (such as the CPUID instruction) before and/or after the RDPMC instruction so lfence seems to be not enough. Also I think you need a compiler barrier just after the first __rdpmc Good point. Why are there so many invocations of copy_fn in warmup? I thought it was required to warm uops cache up. But the amount of invocation was picked up at random. — St.Antario
Oh, right. You can use the sequence mfence;lfence. There is really no need for a fully serializing instruction, but sure you can use cpuid. Anyway, this detail is only important when you want to do very fine-grained cycle accounting. — Hadi Brais

John D McCalpin John D McCalpin · Accepted Answer · 2020-02-12T17:30:12

Whenever possible, benchmarks should report results in ways that allow as much "sanity-checking" as possible. In this case, a few ways to enable such checks include:

For tests involving main memory bandwidth, results should be presented in units that allow direct comparison with the known peak DRAM bandwidth of the system. For a typical configuration of the Core i7-8565U, this is 2 channels * 8 Bytes/transfer * 2.4 billion transfers/sec = 38.4 GB/s (See also item (6), below.)
For tests that involve transfer of data anywhere in the memory hierarchy, the results should include a clear description of the size of the "memory footprint" (number of distinct cache line addresses accessed times the cache line size) and the number of repetitions of the transfer(s). Your code is easy to read here and the size is completely reasonable for a main memory test.
For any timed test, the absolute time should be included to enable comparison against plausible overheads of timing. Your use of only the CORE_CYCLES_UNHALTED counter makes it impossible to compute the elapsed time directly (though the test is clearly long enough that timing overheads are negligible).

Other important "best practice" principles:

Any test that employs RDPMC instructions must be bound to a single logical processor. Results should be presented in a way that confirms to the reader that such binding was employed. Common ways to enforce such binding in Linux include using the "taskset" or "numactl --physcpubind=[n]" commands, or including an inline call to "sched_setaffinity()" with a single allowed logical processor, or setting an environment variable that causes a runtime library (e.g., OpenMP) to bind the thread to a single logical processor.
When using hardware performance counters, extra care is needed to ensure that all of the configuration data for the counters is available and described correctly. The code above uses RDPMC to read IA32_PERF_FIXED_CTR1, which has an event name of CPU_CLK_UNHALTED. The modifier to the event name depends on the programming of IA32_FIXED_CTR_CTRL (MSR 0x38d) bits 7:4. There is no generally-accepted way of mapping from all possible control bits to event name modifiers, so it is best to provide the complete contents of IA32_FIXED_CTR_CTRL along with the results.
The CPU_CLK_UNHALTED performance counter event is the right one to use for benchmarks of portions of the processor whose behavior scales directly with processor core frequency -- such as instruction execution and data transfers involving only the L1 and L2 caches. Memory bandwidth involves portions of the processor whose performance does not scale directly with processor frequency. In particular, using CPU_CLK_UNHALTED without also forcing fixed-frequency operation makes it impossible to compute the elapsed time (required by (1) and (3) above). In your case, RDTSCP would have been easier than RDPMC -- RDTSC does not require the processes to be bound a single logical processor, it is not influenced by other configuration MSRs, and it allows direct computation of elapsed time in seconds.
Advanced: For tests involving transfer of data in the memory hierarchy, it is helpful to control for cache contents and the state (clean or dirty) of the cache contents, and to provide explicit descriptions of the "before" and "after" states along with the results. Given the sizes of your arrays, your code should completely fill all levels of the cache with some combination of portions of the source and destination arrays, and then flush all of those addresses, leaving a cache hierarchy that is (almost) completely full of invalid (clean) entries.
Advanced: Using CPUID as a serialization instruction is almost never useful in benchmarking. Although it guarantees ordering, it also takes a long time to execute -- Agner Fog's "Instruction Tables" report it at 100-250 cycles (presumably depending on the input arguments). (Update: Measurements over short intervals are always very tricky. The CPUID instruction has a long and variable execution time, and it is not clear what impact the microcoded implementation has on the internal state of the processor. It may be helpful in specific cases, but it should not be considered as something that is automatically included in benchmarks. For measurements over long intervals, out-of-order processing across the measurement boundaries is negligible, so CPUID is not needed.)
Advanced: Using LFENCE in benchmarks is only relevant if you are measuring at very fine granularity -- less than a few hundred cycles. More notes on this topic at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

If I assume that your processor was running at its maximum Turbo frequency of 4.6 GHz during the test, then the reported cycle counts correspond to 9.67 milliseconds and 5.23 milliseconds, respectively. Plugging these into a "sanity check" shows:

Assuming that the first case performs one read, one allocate, and one writeback (each 128MiB), the corresponding DRAM traffic rates are 27.8GB/s + 13.9 GB/s = 41.6 GB/s == 108% of peak.
Assuming that the second case performs one read and one streaming store (each 128MiB), the corresponding DRAM traffic rates are 25.7 GB/s + 25.7 GB/s = 51.3 GB/s = 134% of peak.

The failure of these "sanity checks" tells us that the frequency could not have been as high as 4.6 GHz (and was probably no higher than 3.0 GHz), but mostly just points to the need to measure the elapsed time unambiguously....

Your quote from the optimization manual on the inefficiency of streaming stores applies only to cases that cannot be coalesced into full cache line transfers. Your code stores to every element of the output cache lines following "best practice" recommendations (all store instructions writing to the same line are executed consecutively and generating only one stream of stores per loop). It is not possible to completely prevent the hardware from breaking up streaming stores, but in your case it should be extremely rare -- perhaps a few out of a million. Detecting partial streaming stores is a very advanced topic, requiring the use of poorly-documented performance counters in the "uncore" and/or indirect detection of partial streaming stores by looking for elevated DRAM CAS counts (which might be due to other causes). More notes on streaming stores are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

Benchmarking memory copy in a single shot

1 Answers