Precise measurements of maximum cycle count with RDTSC

Question

I'm developing low level routines for binary search in C and x64 assembly, and trying to measure the exact execution time for searches of uncached arrays (data in RAM). Searching the same array for different targets takes widely varying amounts of time depending on how "lucky" the branch prediction is. I can accurately measure the minimum and median execution times, but I'm finding it difficult to measure the maximum execution time.

The problem is that the worst-case scenario for branch prediction is comparable in time to an average-case scenario plus a processor interrupt. Both the worst-case scenario and the interrupt are rare, but I haven't come up with a good way to distinguish one rare event from the other. The standard approach is simply to filter out all "anomalously" high measurements, but that only works if there is a clear line between the two.

So the question becomes, "How can I distinguish between a measurement that was interrupted and one that legitimately took much longer than the rest?"

Or more generally, "How do I measure the full distribution of execution times without assuming in advance a hard maximum?"

Does the kernel store any information I could query about whether an interrupt has occurred? Something that I could query before and after the measurement that would tell me if the measurement was interrupted? Ideally it would tell me how long in cycles the interruption took, but just knowing that the measurement is affected would be a great start.

Maybe in addition to (or instead of) RDTSC, I can use RDPMC to read a counter that measures the number of cycles spent in Ring 0 (kernel) instead of Ring 3 (user)? Is there maybe a counter already set up to do this, or would I need to set up my own? Do I need to create my own kernel module to do this, or can I use existing ioctls?

Some background:

I'm running mostly running Ubuntu 14.03 Linux 4.2.0 on an Intel Skylake i7-6700, but am also testing on Intel Sandy Bridge and Haswell. I've already done my best to reduce the jitter on the system as much as possible. I've recompiled a tickless kernel with CONFIG_NOHZ_FULL, no forced preemption, transparent hugepage support off, and the timer frequency at 100 Hz.

I've stopped most unnecessary processes, and removed most unnecessary kernel modules. I'm using cpuset / cset shield to reserve one of the NoHZ cores for a single process, and have used kernel/debug/tracing to verify that I'm getting very few interrupts. But I'm still getting just enough that exact measurements are difficult. Maybe more importantly, I can envision future long-tail circumstances (a hash table that rarely needs to be resized) where being able to distinguish between valid and invalid measurements would be very helpful

I'm measuring execution time with RDTSC/RDTSCP using the techniques that Intel suggests in their whitepaper, and generally getting the accuracy I expect. My tests involve searching for 16-bit values, and I repeatedly and individually time each of the 65536 possible searches on random arrays of varying lengths. To prevent the processor from learning correct branch prediction, the searches are repeated in a different order each time. The searched array is removed from cache after each search with "CLFLUSH".

This is a research project, and my goal is to learn about these issues. As such, I'm willing to approaches that might otherwise be considered silly and extreme. Custom kernel modules, protected mode x64 assembly, untested kernel modifications, and processor specific features are all fair game. If there's a way to get rid of the few remaining interrupts so that all measurements are "real", that might be a viable solution as well. Thanks for suggestions!

I can't tell what your question is. All the detail doesn't add clarity here. — Jeff Hammond
Hi Jeff -- Sorry if I failed on clarity. The questions are the two parts in bold. I'm looking for a way to determine retroactively if the processor was interrupted during an RDTSC measurement. For background, consider this thread: software.intel.com/en-us/forums/… where Patrick says "Note that, in the ring3 (user land), you may also get interrupts right in the middle of your code, which may mess up your counts." I'd like to figure out how to determine when that has happened. — Nathan Kurz
I would assume first you need to be testing the fastest possible version of binary search. Check Bentley "Programming Pearls" p. 87. Example: j=i+8; if (a[j] < key) i=j; is one line in an unrolled list of statements, where in place of the number 8 there are descending powers of 2. Then for timing purposes, just run it 10^9 times and use a stopwatch. — Mike Dunlavey
Hi Mike -- No, the goal isn't to test the fastest search, but to characterize different search approaches. Separately, Bentley's approach doesn't produce the fastest search on modern processors. Because of the decoded instruction cache and loop cache, unrolling doesn't make much of a difference anymore. Because of SIMD and multiscalar execution, counting instructions doesn't match up to execution time. What really matters is memory prefetching and branch prediction. This paper focuses primarily different layouts, but covers the issues I'm interested in: arxiv.org/abs/1509.05053. — Nathan Kurz

Nicolai Stange Nicolai Stange · Accepted Answer · 2016-02-05T16:27:38

I assume that you have shielded your benchmarking thread to the extent possible:

It has exclusive access to its CPU core (not only HyperThread), see here on how to manage this easily.
Interrupt affinities have been moved away from that core, see here
If possible, run a nohz kernel in order to minimize timer ticks.

Furthermore, you should not jump into kernel space from within your benchmarked code path: upon return, your thread might get scheduled away for some time.

However, you simply can't get rid of all interrupts on a CPU core: on Linux, the local APIC timer interrupts, interprocessor interrupts (IPI) and others are used for internal purposes and you simply can't get rid of them! For example, timer interrupts are used to ensure, that threads get eventually scheduled. Likewise, IPIs are used to do trigger actions on other cores such as TLB shootdowns.

Now, thanks to the Linux tracing infrastructure, it is possible to tell from userspace whether a hardirq has happened during a certain period of time.

One minor complication is that Linux treats two classes of interrupts differently with respect to tracing:

First there are the "real" external hardware interrupts occupied by real devices like network adapters, sound cards and the such.
There are the interrupts for internal use by Linux.

Both are hardirqs in the sense that the processor asynchronously transfers control to an interrupt service routine (ISR) as dictated by the interrupt descriptor table (IDT).

Usually, in Linux, the ISR is simply a stub written in assembly which transfers control to a high-level handler written in C.

For details, refer to arch/x86/entry_entry_64.S in the Linux kernel sources. For the Linux-internal interrupts, a tracing stub is defined each while for the external interrupts, tracing is left to the high-level interrupt handlers.

This is way there is one tracing event for each internal interrupt:

# sudo perf list | grep irq_vectors:
  irq_vectors:call_function_entry                    [Tracepoint event]
  irq_vectors:call_function_exit                     [Tracepoint event]
  irq_vectors:call_function_single_entry             [Tracepoint event]
  irq_vectors:call_function_single_exit              [Tracepoint event]
  irq_vectors:deferred_error_apic_entry              [Tracepoint event]
  irq_vectors:deferred_error_apic_exit               [Tracepoint event]
  irq_vectors:error_apic_entry                       [Tracepoint event]
  irq_vectors:error_apic_exit                        [Tracepoint event]
  irq_vectors:irq_work_entry                         [Tracepoint event]
  irq_vectors:irq_work_exit                          [Tracepoint event]
  irq_vectors:local_timer_entry                      [Tracepoint event]
  irq_vectors:local_timer_exit                       [Tracepoint event]
  irq_vectors:reschedule_entry                       [Tracepoint event]
  irq_vectors:reschedule_exit                        [Tracepoint event]
  irq_vectors:spurious_apic_entry                    [Tracepoint event]
  irq_vectors:spurious_apic_exit                     [Tracepoint event]
  irq_vectors:thermal_apic_entry                     [Tracepoint event]
  irq_vectors:thermal_apic_exit                      [Tracepoint event]
  irq_vectors:threshold_apic_entry                   [Tracepoint event]
  irq_vectors:threshold_apic_exit                    [Tracepoint event]
  irq_vectors:x86_platform_ipi_entry                 [Tracepoint event]
  irq_vectors:x86_platform_ipi_exit                  [Tracepoint event]

while there is only a single general tracing event for the external interrupts:

# sudo perf list | grep irq:
  irq:irq_handler_entry                              [Tracepoint event]
  irq:irq_handler_exit                               [Tracepoint event]
  irq:softirq_entry                                  [Tracepoint event]
  irq:softirq_exit                                   [Tracepoint event]
  irq:softirq_raise                                  [Tracepoint event]

So, trace all these IRQ *_entries for the duration of your benchmarked code path and you know whether your benchmark sample has been poisoned by an IRQ or not.

Note that there is a third kind of hardware interrupt style on x86: exceptions. At the very least, I would check for page faults, too. And for NMIs that have been missed above (through nmi:nmi_handler).

For your convenience, I've put together a little piece of code for tracking IRQs during your benchmarked code path. See the included example.c for usage. Note that access to /sys/kernel/debug is needed in order to determine tracepoint IDs.

Precise measurements of maximum cycle count with RDTSC

2 Answers