I did some profiling with the following setup: The test machine (AMD Athlon64 x2 3800+) was booted, switched to long mode (interrupts disabled) and the instruction of interest was executed in a loop, 100 iterations unrolled and 1,000 loop cycles. The loop body was aligned to 16 bytes. The time was measured with an rdtsc instruction before and after the loop. Additionally a dummy loop without any instruction was executed (which measured 2 cycles per loop iteration and 14 cycles for the rest) and the result was substracted from the result of the instruction profiling time.
The following instructions were measured:
- "
lock cmpxchg [rsp - 8], rdx
" (both with comparison match and mismatch),
- "
lock xadd [rsp - 8], rdx
",
- "
lock bts qword ptr [rsp - 8], 1
"
In all cases the time measured was about 310 cycles, the error was about +/- 8 cycles
This is the value for repeated execution on the same (cached) memory. With an additional cache miss, the times are considerably higher. Also this was done with only one of the 2 cores active, so the cache was exclusively owned, and no cache synchonisation was required.
To evaluate the cost of a locked instruction on a cache miss, I added a wbinvld
instruction before the locked instruction and put the wbinvld
plus an add [rsp - 8], rax
into the comparison loop. In both cases the cost was about 80,000 cycles per instruction pair! In case of lock bts the time difference was about 180 cycles per instruction.
Note that this is the reciprocal throughput, but since locked operations are serializing operations, there is probably no difference to latency.
Conclusion: a locked operation is heavy, but a cache miss can be much heavier.
Also: a locked operation does not cause cache misses. It can only cause cache synchronisation traffic, when a cacheline is not owned exclusively.
To boot the machine, I used an x64 version of FreeLdr from the ReactOS project. Here's the asm source code:
#define LOOP_COUNT 1000
#define UNROLLED_COUNT 100
PUBLIC ProfileDummy
ProfileDummy:
cli
// Get current TSC value into r8
rdtsc
mov r8, rdx
shl r8, 32
or r8, rax
mov rcx, LOOP_COUNT
jmp looper1
.align 16
looper1:
REPEAT UNROLLED_COUNT
// nothing, or add something to compare against
ENDR
dec rcx
jnz looper1
// Put new TSC minus old TSC into rax
rdtsc
shl rdx, 32
or rax, rdx
sub rax, r8
ret
PUBLIC ProfileFunction
ProfileFunction:
cli
rdtsc
mov r8, rdx
shl r8, 32
or r8, rax
mov rcx, LOOP_COUNT
jmp looper2
.align 16
looper2:
REPEAT UNROLLED_COUNT
// Put here the code you want to profile
// make sure it doesn't mess up non-volatiles or r8
lock bts qword ptr [rsp - 8], 1
ENDR
dec rcx
jnz looper2
rdtsc
shl rdx, 32
or rax, rdx
sub rax, r8
ret