A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).
You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.
Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:
uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();
atomic_add(e - s, &acc);
atomic_add(1, &counter);
You may still have an off-by-one in your average measurement depending on where your read happens. For instance:
T1 T2
t0 atomic_add(e - s, &acc);
t1 a = atomic_read(&acc);
t2 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4 avg = a / c;
It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.
Side-points:
- If you do use cpuid+rdtsc, you need to subtract out the cost of the cpuid instruction, which may be difficult to ascertain if you're in a VM (depending on how the VM implements this instruction). This is really why you should stick with rdtscp.
- Executing rdtscp inside a loop is usually a bad idea. I somewhat frequently see microbenchmarks that do things like
--
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
s = rdtscp();
loop_body();
e = rdtscp();
acc += e - s;
}
printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body()
, it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:
s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.
CPUID
implement some kind of full memory barrier? That would be noticeably expensive. – Kerrek SB