Why any modern x86 masks shift count to the 5 low bits in CL

votes

I'm digging into left and right shift operations in x86 ASM, like shl eax, cl

From IA-32 Intel Architecture Software Developer’s Manual 3

All IA-32 processors (starting with the Intel 286 processor) do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions.

I'm trying to understand the reasoning behind this logic. Maybe it works this way because on a hardware level it is hard to implement shift for all 32 (or 64) bits in a register using 1 cycle?

Any detailed explanation would help a lot!

assemblyx86cpu-architecturebit-shiftcpu-registers

It can shift all bits in a single cycle. A 5 bit shift count allows a shift value of 31, which is the maximum useful shift for a 32-bit register. (Anything larger would always result in 0.) 64-bit processors use a 6 bit shift count, to allow shifting up to 63 bits. – prl

@prl Thank you! Yes it is perfectly clear. But what is the reasoning behind this restriction? Maybe I want to do 32 bits shift to get 0. For me as for developer it is logical and natural expected behavior. So, the question arises: Is it a hardware problem or just a some kind decision like "we as scientists decided that it would be more consistent and logical to implement it that way using max 31 bits shift"? – No Name QA

4 Answers

votes

Edited to correct statement re: 80386, which (to my surprise) did have a barrel shifter.

Happy to hear the 286 described as "modern" :-)

The 8086 ran a SHL AX, CL in 8 clocks + 4 clocks per bit shifted. So if CL = 255 this is a seriously slow instruction !

So the 286 did everybody a favour and clamped the count by masking to 0..31. Limiting the instruction to at most 5 + 31 clocks. Which for 16 bit registers is an interesting compromise.

[I found "80186/80188 80C186/80C188 Hardware Reference Manual" (order no. 270788-001) which says that this innovation appears there first. SHL et al ran 5+n clocks (for register operations), same like the 286. FWIW, the 186 also added PUSHA/POPA, PUSH immed., INS/OUTS, BOUND, ENTER/LEAVE, INUL immed. and SHL/ROL etc. immed. I do not know why the 186 appears to be a non-person.]

For the 386 they kept the same mask, but that applies also to 32-bit register shifts. I found a copy of the "80386 Programmer's Reference Manual" (order no. 230985-001), which gives a clock count of 3 for all register shifts. The "Intel 80386 Hardware Reference Manual" (order no. 231732-002), section 2.4 "Execution Unit" says that the Execution Unit includes:

• The Data Unit contains the ALU, a file of eight 32-bit general-purpose registers, and a 64-bit barrel shifter (which performs multiple bit shifts in one clock).

So, I do not know why they did not mask 32-bit shifts to 0..63. At this point I can only suggest the cock-up theory of history.

I agree it is a shame that there isn't a (GPR) shift which returns zero for any count >= argument size. That would require the hardware to check for any bit set beyond the bottom 6/5, and return zero. As a compromise, perhaps just the Bit6/Bit5.

[I haven't tried it, but I suspect that using PSLLQ et al is hard work -- shuffling count and value to xmm and shuffling the result back again -- compared to testing the shift count and masking the result of a shift in some branch-free fashion.]

Anyway... the reason for the behaviour appears to be history.

votes

For electronics; if the shift count is constant you can shift by doing nothing (it's like connecting the wire for "input bit 0" to the wire for "output bit 1", etc).

You can break a variable shift count into multiple "shift with constant count" operations, ending up with something vaguely like:

if( (count & 1) != 0) { v = v << 1; }
if( (count & 2) != 0) { v = v << 2; }
if( (count & 4) != 0) { v = v << 4; }
if( (count & 8) != 0) { v = v << 8; }
if( (count & 16) != 0) { v = v << 16; }

Of course these conditions become nothing too (its more like, "bit 0 of count is enable/disable flag for circuit that does constant shift by 1"). The problem is that each "shift by constant" depends on the value of the previous "shift by constant", so you can't start "step N+1" until "step N" completes. That synchronization between steps takes time, so more steps (supporting larger counts) makes it slower. Counts that are larger than the number of bits in a register are rare; and you don't really want to make common cases slower to support rare cases.

votes

Despite what Intel's current manuals say, masking the shift count was new in 186. For example, this CPU-detection code on reverse-engineering.SE uses that fact to distinguish 8086/88 from 80186/88. Perhaps Intel isn't counting 186 because it wasn't 100% IBM-PC compatible and was intended for embedded systems? Or Intel's current manual is just wrong; wouldn't be the first time.

This was a mostly arbitrary design decision during x86's evolution from simple micro-coded 8086 to 186, 286 and 386, but we can see some motivations. 386 had a barrel shifter (constant-time shifts), 186 and 286 didn't. IDK if the ISA design decision was nailed down before or after that HW design decision.

ARM chose differently and saturates shift counts instead of wrapping them. An ARM shift by the register width or more does zero the value.

And x86 SIMD shifts like pslld xmm0, 32 or pslld xmm1, xmm0 saturate the count; you can shift out all the bits of each element with MMX/SSE/AVX shifts, or on a per-element basis with AVX2 vpsllvd/q which might be good if you're calculating a per-element shift count with c-192, c-128, c-64, c or something. OTOH AVX512VBMI2 VPSHRDVw/d/q SIMD double-shift is does mask the count to the operand-size -1, making it impossible to have some elements shift all the way past the boundary and leave only bits from src2 in the destination element. As discussed below for 386 scalar shrd, this would have required wider barrel shifters, or some special-casing of high counts.

186 / 286 had O(n) shifts/rotates (no barrel shifter) so masking limits the worst-case shift performance.

8086: SHL AX, CL takes 8 clocks + 4 clocks per bit shifted. Worst-case for CL=255 is 1028 cycles. 286: 5 + n, worst case 5+31 = 36 cycles.

286 shift-count masking may also limit worst-case interrupt latency for multi-tasking systems if shifts can't abort mid-instruction and there aren't any even slower instructions. (286 introduced its version of protected mode so perhaps Intel was considering multi-user setups with a malicious unprivileged user trying to denial-of-service the system.) Or maybe the motivation was real code that accidentally(?) used large shift counts. Also, if shifts aren't fully microcoded, there's no need to make the count input any wider than 5 bits in dedicated shift hardware. Building a wider counter just so it can take longer isn't useful.

Update: masked counts being new in 186 rules out multi-user fairness, but could still avoid worst-case IRQ latency with software that lets large shift counts zero registers.

The 186 / 286 behaviour for 16-bit registers needed to maintain sufficient backwards compatibility with 8086 for existing software. This might be why the masking is to 5-bit counts (% 32), not % 16. (Not using % 16 or % 8 for 8-bit operand-size might also make the shift counter HW simpler, instead of muxing the high bit to 0 depending on operand size.)

Backwards compat is one of x86's main selling points. Presumably no widely used (on 8086) software depended on shift counts greater than 32 still zeroing a register, otherwise Intel might have saturated the count by checking all the high bits for zero and muxing with the result of a shifter that only used the low 4 bits.

But note that rotates use the same count-masking, so hypothetical hardware that detected high counts would have to avoid zeroing the result for rotates, and would still have to get FLAGS right for shifts by exactly 32, and for rotate-through-carry.

Another maybe-important reason for 16-bit 186 masking to % 32 is rotate-through-carry (rcl / rcr), which on 8086 can be meaningful with a count of 16. (Count mod 9 or 17 would be equivalent.) 32-bit rcl can't rotate by 32, though; still masked to % 32. But that's not a backwards-compat issue; rotate by 16 to 31 potentially is, if any code ever used RCL / RCR by more than 1 in the first place. (Definitely one of the more obscure instructions.)

So probably 186's cl % 32 design was compatible enough, and achieved the desired HW simplification / upper limit on cycles spent shifting.

186 was apparently intended for embedded use and had some integrated devices with addresses that conflicted with IBM-PC, so perhaps Intel felt like they could experiment with this change in 186 to see if it caused problems. Since it didn't(?), they kept it for 286? This is a totally made up guess based on a couple random facts extracted from comments from other people. I wasn't using PCs until Linux on a P-MMX Pentium and am only idly curious about this history, not a retrocomputing enthusiast. Speaking of which, you https://retrocomputing.stackexchange.com/ might be a good place to ask about this 186 design decision.

Why didn't 386 widen the count mask for wider shifts?

Why not have 386 still able to shift out all the bits with shl eax, 32?

There was no existing software using 32-bit registers that 386 needed to be backwards compat with. 32-bit mode (and 32-bit operand size in 16-bit mode) was new with 386. So 386 could have chosen anything for 32-bit shifts. (But 8 and 16-bit shifts work exactly the same as in 186/286 to ensure compatibility.)

I don't know if Intel thought masked shift counts were actively useful as a feature or not. Masking to the same % 32 as 16-bit shifts was probably the easiest for them to implement, and is usable for 32-bit shifts.

386 had O(1) shifts with a barrel shifter, according to some random SO comments. Supporting larger shift counts would require a wider barrel shifter.

386 also introduced shld / shrd double-precision shifts that shift in bits from another register, instead of 0 or copies of the sign bit. It would have been neat to be able to shift all the bits out and use shld eax, edx, 37 as a copy-and-shift with a false dependency. But supporting counts >= 32 for shl/rd would require a wider barrel shifter, not just a "zero the output on high bits set" check. For each output bit, the current design has 32 possible sources for that bit. Allowing wider counts would increase that to 64 possible sources for each result bit. As @Brendan shows, you can do a multi-step process instead of building a 32:1 muxer for each bit, but then you have more gate delays.

It would be inconsistent for SHLD / SHRD to treat their count differently from other shifts, and anything other than % 32 makes it harder to build.

I'm not sure this argument holds water: shld ax, dx, 25 would in theory do something, but Intel's current manual says If a count is greater than the operand size, the result is undefined. (I didn't test actual HW to see what happens.) Intel could simply have said the same thing for 32-bit shld/shrd in 386 if wider counts were allowed for other shifts.

Random thought: Rotate-through-carry is slow and micro-coded on modern CPUs for counts != 1. IDK if that would be another complication or not.

votes

I don't think that shift 32bit register by 32 is more difficult than shift by 31 bits. From mathematical point of view it would be more appropriate to saturate the shift count instead of to mask. We have to remember that SHR EAX,32 does nothing and other instruction have to be used to clear the contents of EAX.

Perhaps Intel developers wanted to use the same internal mechanism for rotate and shift operations. For instance ROR EAX,35 is equivalent to ROR EAX,3, and consequently SHR EAX,35 is equivalent to SHR EAX,3.