Your code looks correct though you should run it several times and use the shortest value that comes up.
I think the question should be restated: what is the overhead of using rdtsc to count elapsed clock cycles during a code sequence. So the counting code is essentially (32-bit example):
rdtsc
mov dword ptr [mem64],eax
mov dword ptr [mem64+4],edx
; the code sequence to clock would go here when you're clocking it
rdtsc
sub eax,dword ptr [mem64]
sbb edx,dword ptr [mem64+4] ; I always mix up sbb and sub so this may be incorrect
and the result is the practical elapsed time of the "rdtsc overhead" when timing a code sequence.
When you have subtracted the rdtsc overhead you need to factor in pipelining and if overlapping processing has completed. For me I assume that if the timed sequence runs in fewer than perhaps 30 cycles there may be uncompleted pipelining issues that need to be taken into account. If the sequence requires more than 100 cycles there may issues but they may be ignored.
So what about between 30 and 100? It's definitely gray.
rdtsc()
is negligible. – Mysticialrdtsc
has already been measured. See instlatx64.atw.hu – harold