18
votes

Sometimes I encounter code that reads TSC with rdtsc instruction, but calls cpuid right before.

Why is calling cpuid necessary? I realize it may have something to do with different cores having TSC values, but what exactly happens when you call those two instructions in sequence?

3
In addition to paxdiablo's answer, note that even a single core like Pentium Pro, II and III can do out of order execution. Chapter 6 from Agner-Fog/microarchitecturetalekeDskobeDa

3 Answers

18
votes

It's to prevent out-of-order execution. From a link that has now disappeared from the web (but which was fortuitously copied here before it disappeared), this text is from an article entitled "Performance monitoring" by one John Eckerdal:

The Pentium Pro and Pentium II processors support out-of-order execution instructions may be executed in another order as you programmed them. This can be a source of errors if not taken care of.

To prevent this the programmer must serialize the the instruction queue. This can be done by inserting a serializing instruction like CPUID instruction before the RDTSC instruction.

6
votes

Two reasons:

  • As paxdiablo says, when the CPU sees a CPUID opcode it makes sure all the previous instructions are executed, then the CPUID taken, before any subsequent instructions execute. Without such an instruction, the CPU execution pipeline may end up executing TSC before the instruction(s) you'd like to time.
  • A significant proportion of machines fail to synchronise the TSC registers across cores. In you want to read it from a horse's mouth - knock yourself out at http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx. So, when measuring an interval between TSC readings, unless they're taken on the same core you'll have an effectively random but possibly constant (see below) interval introduced - it can easily be several seconds (yes seconds) even soon after bootup. This effectively reflects how long the BIOS was running on a single core before kicking off the others, plus - if you've any nasty power saving options on - increasing drift caused by cores running at different frequencies or shutting down again. So, if you haven't nailed the threads reading TSC registers to the same core then you'll need to build some kind of cross-core delta table and know the core id (which is returned by CPUID) of each TSC sample in order to compensate for this offset. That's another reason you can see CPUID alongside RDTSC, and indeed a reason why with newer RDTSCP many OSes are storing core id numbers into the extra TSC_AUX[31:0] data returned. (Available from Core i7 and Athlon 64 X2, RDTSCP is a much better option in all respects - the OS normally gives you the core id as mentioned, atomic to the TSC read, and prevent instruction reordering).
3
votes

CPUID is serializing, preventing out-of-order execution of RDTSC.

These days you can safely use LFENCE instead. It's documented as serializing on the instruction stream (but not stores to memory) on Intel CPUs, and now also on AMD after their microcode update for Spectre.

https://hadibrais.wordpress.com/2018/05/14/the-significance-of-the-x86-lfence-instruction/ explains more about LFENCE.

See also https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf for a way to use RDTSCP that keeps CPUID (or LFENCE) out of the timed region:

LFENCE     ; (or CPUID) Don't start the timed region until everything above has executed
RDTSC           ; EDX:EAX = timestamp
mov  ebx, eax   ; low 32 bits of start time

   code under test

RDTSCP     ; built-in one way barrier stops it from running early
LFENCE     ; (or CPUID) still use a barrier after to prevent anything weird
sub  eax, ebx   ; low 32 bits of end-start

See also Get CPU cycle count? for more about RDTSC caveats, like constant_tsc and nonstop_tsc.

As a bonus, RDTSCP gives you a core ID. You could use RDTSCP for the start time as well, if you want to check for core migration. But if your CPU has the constant_tsc features, all cores in the package should have their TSCs synced so you typically don't need this on modern x86.

You could get the core ID from CPUID instead, as @Tony's answer points out.