Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).
The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.
So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.
I guess the idea would be a long-latency instruction that doesn't use many resources, like pause
and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P
we just issue P/L
of these instructions for a calibrated busy-sleep. Well pause
doesn't meet that latter criterion, as its latency varies a lot1.
A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc
or some other clock reading method (clock_gettime
, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.
Any better options?
1 Also pause
has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.
movd
's is only slowed down slightly by the occasionalsqrtsd
. – haroldmovnti
store / reload would use up memory resources instead of ALU. That's definitely going to be variable latency so only usable withrdtsc
(not calibration), but will sleep for ~500 cycles, so it's pretty light-weight. – Peter Cordespause
really seems close to ideal. Pre-Skylake it's less clear. – BeeOnRope