The time stamp counter was introduced on the Pentium microarchitecture. Out-of-order execution didn't show up until the Pentium Pro. Intel could have made rdtsc
serializing (architecturally or internally), but it seems that they decided to keep it non-serializing, which is OK for general-purpose time measurements, and leave it up to the programmer to add serializing instructions if necessary. This is good for reducing the overhead of the measurement.
That's actually confirmed in the document you provide, with the following comment about Pentium and Pentium/MMX (in 4.2, slightly paraphrased):
All of the rules and code samples described in section 4.1 (Pentium Pro and Pentium II) also apply to the Pentium and Pentium/MMX. The only difference is, the CPUID instruction is not necessary for serialization.
And, from Wikipedia:
The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium.
: : :
Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.
One of the two uses of RDTSCP is to give you the processor ID in addition to the time stamp information (it's right there in the name Read Time-Stamp Counter *AND* Processor ID
), which is useful on systems with unsynced TSCs across cores or sockets (See: How to get the CPU cycle count in x86_64 from C++?). The additional serialization properties of rdtscp
makes it more convenient at the end of the region of interest (See: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?).
RDTSCP
isn't serializing the wayCPUID
is. It's only a one-way barrier for instructions, and doesn't stop later instructions from executing before it (and other earlier instructions). – Peter Cordes