Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div
/ idiv
are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32
(10c 2 uops) and movd r32, xmm
(8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd
to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss
has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32
of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss
so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd
for higher latency per uop than sqrtss
. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss
instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0
, 1.0
, +inf
, or NaN
, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0
or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1
to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7
(2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0
or 1.0
. Probably cheaper to just and
with 0 and use movd
, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm
- 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).