Measuring performance of memcpy on x86-64

Question

I have 3 memory blocks.

char block_a[1600]; // Initialized with random chars
unsigned short block_b[1600]; // Initialized with random shorts 0 - 1599 with no duplication
char block_c[1600]; // Initialized with 0

I am performing following copy operation on this

for ( int i = 0; i < 1600; i++ ) {
    memcpy(block_c[i], block_a[block_b[i]], sizeof(block_a[0]); // Point # 1
}

Now I am trying to measure the CPU cycles + time in NS of above operation I am doing at Point # 1.

Measuring Environment

1) Platform: Intel x86-64. Core i7
2) Linux Kernel 3.8

Measuring Algorithm

0) Implementation is done as kernel module so that I can have full control and precise data
1) Measured the overhead of CPUID + MOV instruction which I will use for serialization.
2) Disabled preemption + interrupts to get exclusive access of CPU
3) Called CPUID to make sure pipeline is clear of out-of-order instructions upto this point
4) Called RDTSC to get the initial value of TSC and saved this value
5) Performed the operation I want to measure which I have mentioned above
6) Called RDTSCP to get the end value of TSC and saved this value
7) Called CPUID again to make sure nothing gets inside our two RDTSC calls in out-of-order fashion
8) Subtracted end TSC value from start TSC value to get the CPU Cycles taken to perform this operation
9) Subtracted the overhead cycles taken by 2 MOVE instructions, to get the final CPU cycles.

Code

    ....
    ....
    preempt_disable(); /* Disable preemption to avoid scheduling */
    raw_local_irq_save(flags); /* Disable the hard interrupts */
    /* CPU is ours now */
    __asm__ volatile (
        "CPUID\n\t"
        "RDTSC\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t": "=r" (cycles_high_start), "=r" (cycles_low_start)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /*
     Measuring Point Start
    */
    memcpy(&shuffled_byte_array[idx], &random_byte_array[random_byte_seed[idx]], sizeof(random_byte_array[0]));
    /* 
    * Measuring Point End
    */
    __asm__ volatile (
        "RDTSCP\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t"
        "CPUID\n\t": "=r" (cycles_high_end), "=r" (cycles_low_end)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /* Release CPU */
    raw_local_irq_restore(flags);
    preempt_enable();

    start = ( ((uint64_t)cycles_high_start << 32) | cycles_low_start);
    end   = ( ((uint64_t)cycles_high_end << 32) | cycles_low_end);
    if ( (end-start) >= overhead_cycles ) {
        total = ( (end-start) - overhead_cycles);
    } else {
        // We will consdider last total
    }

Question

The CPU cycles measurement I am getting does not seems to be realistic. Given are the results for some samples

Cycles Time(NS)
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0000 0000
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0000 0000
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009

If I load my module again, give are the results.

Cycles Time(NS)
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0006 0005
0006 0005
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0017 0014
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0011 0009
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0000 0000
0022 0018
0006 0005
0011 0009
0006 0005
0006 0005
0104 0086
0104 0086
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0017 0014
0017 0014
0022 0018
0022 0018
0022 0018
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0011 0009
0006 0005
0022 0018
0011 0009
0028 0023
0006 0005
0006 0005
0022 0018
0006 0005
0022 0018
0006 0005
0011 0009
0006 0005
0011 0009
0006 0005
0000 0000
0006 0005
0017 0014
0011 0009
0022 0018
0000 0000
0011 0009
0006 0005
0011 0009
0022 0018
0006 0005
0022 0018
0011 0009
0022 0018
0022 0018
0011 0009
0006 0005
0011 0009
0011 0009
0006 0005
0011 0009
0126 0105
0006 0005
0022 0018
0000 0000
0022 0018
0006 0005
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0006 0005
0011 0009

In above list you will notice there are many copy operations for which I have got 0 CPU cycles. Many times I see < 3 cycles.

What do you think is the reason of getting 0 CPU cycles or very few for memcpy operation ? Any idea how much CPU cycles are taken by memcpy generally.

Update

Following changes I have tried and getting given result
1) Cycle time 0 - 8 if I copy individual byte using memcpy after reboot
2) Cycle time 0, if I copy complete block using memcpy after reboot
3) BIOS changes to single core (though this code is already running on single core only, but just to make sure), no effect on results
4) BIOS changes to disabling Intel SpeedStep have no effect though once this problem is resolved, to get the maximum possible CPU cycles Intel SpeedStep should be disabled for the CPU to work in maximum frequency.

Have you tried generating the assembly output from your compiler and seeing whether the memcpy has actually been optimized out? If you do not actually use the copied memory for something, aggressive optimizations might remove the calls completely. Also to consider for the second run is that your memory may have ended up in cache. — paddy
Thanks paddy. Haven't checked the assembly output of memcpy but I will check this out. The second point you have mentioned seems a probable reason so far. — Farrukh Arshad
You can have a go at flushing your cache out if you want to test/validate that second point. Nothing concrete in this question and answer set for all architectures, but you could definitely knock up something for your system. — paddy
Thanks paddy, I have used CLFLUSH instruction to clear my cache and results looks promising, but the measurement method should also account for the cache performance measurement. — Farrukh Arshad

Farrukh Arshad Farrukh Arshad · Accepted Answer · 2016-02-17T18:42:51

It looks like cache is the reason of incorrect CPU cycles (actually it is not incorrect CPU cycles, but cache performance measurement should also be accounted in this case to get accurate results). After making sure cache is clear for the given data, my results looks fine. I have added following function to clear the cache. clflush function is available in the kernel APIs and it utilizes x86 CLFLUSH instruction.

static void flush_cache(char random_byte_array[], char shuffled_byte_array[])
{
    unsigned int idx = 0;
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(random_byte_array+(idx*64));
    }
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(shuffled_byte_array+(idx*64));
    }
}

memcpy on the full memory block of 1600 bytes
CPU Cycles = 216 - 260 (for multiple tests>

memcpy of individual bytes of 1600 bytes block

Cycles Time (ns)
0159 0132
0000 0000
0000 0000
....
....
0049 0040
0049 0040
0049 0040
0000 0000
0000 0000
....
....

For the memcpy of first element (0th index), it takes around 140 - 160 cycles, for proceeding some elements it takes 0 - 10 cycles, (thats because I guess the data is loaded in cache), after some more elements it takes around 140 - 160 elements (probably cache-miss occurred)

As long as data is not in the cache I am getting the good CPU cycles but whenever data is in cache, cycles are not enough to measure, probably cache performance measurements should also be accounted.

Measuring performance of memcpy on x86-64

1 Answers