Despite what Intel's current manuals say, masking the shift count was new in 186. For example, this CPU-detection code on reverse-engineering.SE uses that fact to distinguish 8086/88 from 80186/88. Perhaps Intel isn't counting 186 because it wasn't 100% IBM-PC compatible and was intended for embedded systems? Or Intel's current manual is just wrong; wouldn't be the first time.
This was a mostly arbitrary design decision during x86's evolution from simple micro-coded 8086 to 186, 286 and 386, but we can see some motivations. 386 had a barrel shifter (constant-time shifts), 186 and 286 didn't. IDK if the ISA design decision was nailed down before or after that HW design decision.
ARM chose differently and saturates shift counts instead of wrapping them. An ARM shift by the register width or more does zero the value.
And x86 SIMD shifts like pslld xmm0, 32
or pslld xmm1, xmm0
saturate the count; you can shift out all the bits of each element with MMX/SSE/AVX shifts, or on a per-element basis with AVX2 vpsllvd/q
which might be good if you're calculating a per-element shift count with c-192
, c-128
, c-64
, c
or something. OTOH AVX512VBMI2 VPSHRDVw/d/q
SIMD double-shift is does mask the count to the operand-size -1, making it impossible to have some elements shift all the way past the boundary and leave only bits from src2 in the destination element. As discussed below for 386 scalar shrd
, this would have required wider barrel shifters, or some special-casing of high counts.
186 / 286 had O(n) shifts/rotates (no barrel shifter) so masking limits the worst-case shift performance.
8086: SHL AX, CL
takes 8 clocks + 4 clocks per bit shifted. Worst-case for CL=255 is 1028 cycles. 286: 5 + n, worst case 5+31 = 36 cycles.
286 shift-count masking may also limit worst-case interrupt latency for multi-tasking systems if shifts can't abort mid-instruction and there aren't any even slower instructions. (286 introduced its version of protected mode so perhaps Intel was considering multi-user setups with a malicious unprivileged user trying to denial-of-service the system.) Or maybe the motivation was real code that accidentally(?) used large shift counts. Also, if shifts aren't fully microcoded, there's no need to make the count input any wider than 5 bits in dedicated shift hardware. Building a wider counter just so it can take longer isn't useful.
Update: masked counts being new in 186 rules out multi-user fairness, but could still avoid worst-case IRQ latency with software that lets large shift counts zero registers.
The 186 / 286 behaviour for 16-bit registers needed to maintain sufficient backwards compatibility with 8086 for existing software. This might be why the masking is to 5-bit counts (% 32
), not % 16
. (Not using % 16
or % 8
for 8-bit operand-size might also make the shift counter HW simpler, instead of muxing the high bit to 0 depending on operand size.)
Backwards compat is one of x86's main selling points. Presumably no widely used (on 8086) software depended on shift counts greater than 32 still zeroing a register, otherwise Intel might have saturated the count by checking all the high bits for zero and muxing with the result of a shifter that only used the low 4 bits.
But note that rotates use the same count-masking, so hypothetical hardware that detected high counts would have to avoid zeroing the result for rotates, and would still have to get FLAGS right for shifts by exactly 32, and for rotate-through-carry.
Another maybe-important reason for 16-bit 186 masking to % 32
is rotate-through-carry (rcl / rcr), which on 8086 can be meaningful with a count of 16. (Count mod 9 or 17 would be equivalent.) 32-bit rcl
can't rotate by 32, though; still masked to % 32
. But that's not a backwards-compat issue; rotate by 16 to 31 potentially is, if any code ever used RCL / RCR by more than 1 in the first place. (Definitely one of the more obscure instructions.)
So probably 186's cl % 32
design was compatible enough, and achieved the desired HW simplification / upper limit on cycles spent shifting.
186 was apparently intended for embedded use and had some integrated devices with addresses that conflicted with IBM-PC, so perhaps Intel felt like they could experiment with this change in 186 to see if it caused problems. Since it didn't(?), they kept it for 286? This is a totally made up guess based on a couple random facts extracted from comments from other people. I wasn't using PCs until Linux on a P-MMX Pentium and am only idly curious about this history, not a retrocomputing enthusiast. Speaking of which, you https://retrocomputing.stackexchange.com/ might be a good place to ask about this 186 design decision.
Why didn't 386 widen the count mask for wider shifts?
Why not have 386 still able to shift out all the bits with shl eax, 32
?
There was no existing software using 32-bit registers that 386 needed to be backwards compat with. 32-bit mode (and 32-bit operand size in 16-bit mode) was new with 386. So 386 could have chosen anything for 32-bit shifts. (But 8 and 16-bit shifts work exactly the same as in 186/286 to ensure compatibility.)
I don't know if Intel thought masked shift counts were actively useful as a feature or not. Masking to the same % 32
as 16-bit shifts was probably the easiest for them to implement, and is usable for 32-bit shifts.
386 had O(1) shifts with a barrel shifter, according to some random SO comments. Supporting larger shift counts would require a wider barrel shifter.
386 also introduced shld
/ shrd
double-precision shifts that shift in bits from another register, instead of 0 or copies of the sign bit. It would have been neat to be able to shift all the bits out and use shld eax, edx, 37
as a copy-and-shift with a false dependency. But supporting counts >= 32 for shl/rd would require a wider barrel shifter, not just a "zero the output on high bits set" check. For each output bit, the current design has 32 possible sources for that bit. Allowing wider counts would increase that to 64 possible sources for each result bit. As @Brendan shows, you can do a multi-step process instead of building a 32:1 muxer for each bit, but then you have more gate delays.
It would be inconsistent for SHLD / SHRD to treat their count differently from other shifts, and anything other than % 32
makes it harder to build.
I'm not sure this argument holds water: shld ax, dx, 25
would in theory do something, but Intel's current manual says If a count is greater than the operand size, the result is undefined. (I didn't test actual HW to see what happens.) Intel could simply have said the same thing for 32-bit shld/shrd in 386 if wider counts were allowed for other shifts.
Random thought: Rotate-through-carry is slow and micro-coded on modern CPUs for counts != 1. IDK if that would be another complication or not.