I have 3 memory blocks.
char block_a[1600]; // Initialized with random chars
unsigned short block_b[1600]; // Initialized with random shorts 0 - 1599 with no duplication
char block_c[1600]; // Initialized with 0
I am performing following copy operation on this
for ( int i = 0; i < 1600; i++ ) {
memcpy(block_c[i], block_a[block_b[i]], sizeof(block_a[0]); // Point # 1
}
Now I am trying to measure the CPU cycles + time in NS of above operation I am doing at Point # 1.
Measuring Environment1) Platform: Intel x86-64. Core i7
2) Linux Kernel 3.8
0) Implementation is done as kernel module so that I can have full control and precise data
1) Measured the overhead of CPUID + MOV instruction which I will use for serialization.
2) Disabled preemption + interrupts to get exclusive access of CPU
3) Called CPUID to make sure pipeline is clear of out-of-order instructions upto this point
4) Called RDTSC to get the initial value of TSC and saved this value
5) Performed the operation I want to measure which I have mentioned above
6) Called RDTSCP to get the end value of TSC and saved this value
7) Called CPUID again to make sure nothing gets inside our two RDTSC calls in out-of-order fashion
8) Subtracted end TSC value from start TSC value to get the CPU Cycles taken to perform this operation
9) Subtracted the overhead cycles taken by 2 MOVE instructions, to get the final CPU cycles.
....
....
preempt_disable(); /* Disable preemption to avoid scheduling */
raw_local_irq_save(flags); /* Disable the hard interrupts */
/* CPU is ours now */
__asm__ volatile (
"CPUID\n\t"
"RDTSC\n\t"
"MOV %%EDX, %0\n\t"
"MOV %%EAX, %1\n\t": "=r" (cycles_high_start), "=r" (cycles_low_start)::
"%rax", "%rbx", "%rcx", "%rdx"
);
/*
Measuring Point Start
*/
memcpy(&shuffled_byte_array[idx], &random_byte_array[random_byte_seed[idx]], sizeof(random_byte_array[0]));
/*
* Measuring Point End
*/
__asm__ volatile (
"RDTSCP\n\t"
"MOV %%EDX, %0\n\t"
"MOV %%EAX, %1\n\t"
"CPUID\n\t": "=r" (cycles_high_end), "=r" (cycles_low_end)::
"%rax", "%rbx", "%rcx", "%rdx"
);
/* Release CPU */
raw_local_irq_restore(flags);
preempt_enable();
start = ( ((uint64_t)cycles_high_start << 32) | cycles_low_start);
end = ( ((uint64_t)cycles_high_end << 32) | cycles_low_end);
if ( (end-start) >= overhead_cycles ) {
total = ( (end-start) - overhead_cycles);
} else {
// We will consdider last total
}
Question
The CPU cycles measurement I am getting does not seems to be realistic. Given are the results for some samples
Cycles Time(NS)
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0000 0000
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0000 0000
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009
If I load my module again, give are the results.
Cycles Time(NS)
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0006 0005
0006 0005
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0017 0014
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0011 0009
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0000 0000
0022 0018
0006 0005
0011 0009
0006 0005
0006 0005
0104 0086
0104 0086
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0017 0014
0017 0014
0022 0018
0022 0018
0022 0018
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0011 0009
0006 0005
0022 0018
0011 0009
0028 0023
0006 0005
0006 0005
0022 0018
0006 0005
0022 0018
0006 0005
0011 0009
0006 0005
0011 0009
0006 0005
0000 0000
0006 0005
0017 0014
0011 0009
0022 0018
0000 0000
0011 0009
0006 0005
0011 0009
0022 0018
0006 0005
0022 0018
0011 0009
0022 0018
0022 0018
0011 0009
0006 0005
0011 0009
0011 0009
0006 0005
0011 0009
0126 0105
0006 0005
0022 0018
0000 0000
0022 0018
0006 0005
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0006 0005
0011 0009
In above list you will notice there are many copy operations for which I have got 0 CPU cycles. Many times I see < 3 cycles.
What do you think is the reason of getting 0 CPU cycles or very few for memcpy operation ? Any idea how much CPU cycles are taken by memcpy generally.
UpdateFollowing changes I have tried and getting given result
1) Cycle time 0 - 8 if I copy individual byte using memcpy after reboot
2) Cycle time 0, if I copy complete block using memcpy after reboot
3) BIOS changes to single core (though this code is already running on single core only, but just to make sure), no effect on results
4) BIOS changes to disabling Intel SpeedStep have no effect though once this problem is resolved, to get the maximum possible CPU cycles Intel SpeedStep should be disabled for the CPU to work in maximum frequency.
memcpy
has actually been optimized out? If you do not actually use the copied memory for something, aggressive optimizations might remove the calls completely. Also to consider for the second run is that your memory may have ended up in cache. – paddy