I'm developing low level routines for binary search in C and x64 assembly, and trying to measure the exact execution time for searches of uncached arrays (data in RAM). Searching the same array for different targets takes widely varying amounts of time depending on how "lucky" the branch prediction is. I can accurately measure the minimum and median execution times, but I'm finding it difficult to measure the maximum execution time.
The problem is that the worst-case scenario for branch prediction is comparable in time to an average-case scenario plus a processor interrupt. Both the worst-case scenario and the interrupt are rare, but I haven't come up with a good way to distinguish one rare event from the other. The standard approach is simply to filter out all "anomalously" high measurements, but that only works if there is a clear line between the two.
So the question becomes, "How can I distinguish between a measurement that was interrupted and one that legitimately took much longer than the rest?"
Or more generally, "How do I measure the full distribution of execution times without assuming in advance a hard maximum?"
Does the kernel store any information I could query about whether an interrupt has occurred? Something that I could query before and after the measurement that would tell me if the measurement was interrupted? Ideally it would tell me how long in cycles the interruption took, but just knowing that the measurement is affected would be a great start.
Maybe in addition to (or instead of) RDTSC, I can use RDPMC to read a counter that measures the number of cycles spent in Ring 0 (kernel) instead of Ring 3 (user)? Is there maybe a counter already set up to do this, or would I need to set up my own? Do I need to create my own kernel module to do this, or can I use existing ioctls?
Some background:
I'm running mostly running Ubuntu 14.03 Linux 4.2.0 on an Intel Skylake i7-6700, but am also testing on Intel Sandy Bridge and Haswell. I've already done my best to reduce the jitter on the system as much as possible. I've recompiled a tickless kernel with CONFIG_NOHZ_FULL, no forced preemption, transparent hugepage support off, and the timer frequency at 100 Hz.
I've stopped most unnecessary processes, and removed most unnecessary kernel modules. I'm using cpuset / cset shield to reserve one of the NoHZ cores for a single process, and have used kernel/debug/tracing to verify that I'm getting very few interrupts. But I'm still getting just enough that exact measurements are difficult. Maybe more importantly, I can envision future long-tail circumstances (a hash table that rarely needs to be resized) where being able to distinguish between valid and invalid measurements would be very helpful
I'm measuring execution time with RDTSC/RDTSCP using the techniques that Intel suggests in their whitepaper, and generally getting the accuracy I expect. My tests involve searching for 16-bit values, and I repeatedly and individually time each of the 65536 possible searches on random arrays of varying lengths. To prevent the processor from learning correct branch prediction, the searches are repeated in a different order each time. The searched array is removed from cache after each search with "CLFLUSH".
This is a research project, and my goal is to learn about these issues. As such, I'm willing to approaches that might otherwise be considered silly and extreme. Custom kernel modules, protected mode x64 assembly, untested kernel modifications, and processor specific features are all fair game. If there's a way to get rid of the few remaining interrupts so that all measurements are "real", that might be a viable solution as well. Thanks for suggestions!
j=i+8; if (a[j] < key) i=j;
is one line in an unrolled list of statements, where in place of the number 8 there are descending powers of 2. Then for timing purposes, just run it 10^9 times and use a stopwatch. – Mike Dunlavey