4
votes

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).

The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.

So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.

I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.

A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.

Any better options?


1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

1
Square roots maybe? Single µop, huge latency, but would burn some power and affect square roots done by the main thread (does it do any?). And I wonder how intentionally mispredicted branches would act.harold
I'm reasonably sure that it's only the FU that is blocked (if it's not pipelined) and the port is only used for issue/writeback. So eg a bunch of movd's is only slowed down slightly by the occasional sqrtsd.harold
Experimentation is fine, and just having fun playing around is great, but for practical reasons as well as portability and compatibility - it's something I'd avoid. Too answer your question, I have no idea what the timings are, but I'm willing to bet (quite heavily) it's negligible compared to anything else you do in your CPU intensive task. If you have a tight inner loop, wrap it with a slower loop (1 : 10000 for example) and do your lookup there. That's my 2 cents anyway...Amit
It would be interesting to play with FP denormals for mul or div, or x87 slowdowns with NaN/Infinity. Those have very high latency, but IDK if the "FP assist" generated to handle that is a lot of microcode uops, or if it's done inside the execution unit. Probably microcode, which makes it not particularly useful for a low-HT-interference thread. A movnti store / reload would use up memory resources instead of ALU. That's definitely going to be variable latency so only usable with rdtsc (not calibration), but will sleep for ~500 cycles, so it's pretty light-weight.Peter Cordes
@PeterCordes - significant variability is more or less OK, especially if it is uncorrelated with what the main thread is doing. Imagine it is used for low-overhead method timing: variability is mostly OK if uncorrelated. It seems like at least some types of instructions wouldn't suffer too much competition at least in some scenarios (e.g., floating point instructions used to sleep while the main code is running integer code)... but something like Skylake pause really seems close to ideal. Pre-Skylake it's less clear.BeeOnRope

1 Answers

1
votes

Some random musing on the subject.

So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.

The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).

  start:
  read time (25 cycles)
  tmp = time - last (1 cycle)
  if tmp < sample length goto start
  last += cycles between samples
  sample = time
  goto start

In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.

When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.

So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.

On the other hand reading the real time clock in the main thread would cost ...

~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.

Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.