I have some sample code from a shell code payload showing a for loop and using push/pop to set the counter:
push 9
pop ecx
Why can it not just use mov?
mov ecx, 9
Yes normally you should always use mov ecx, 9
for performance reasons. It runs more efficiently than push
/pop`, as a single-uop instruction that can run on any port. (This is true across all existing CPUs that Agner Fog has tested: https://agner.org/optimize/)
The normal reason for push imm8
/ pop r32
is that the machine code is free of zero bytes. This is important for shellcode that has to overflow a buffer via strcpy
or any other method that treats it as part of an implicit-length C string terminated by a 0
byte.
mov ecx, immediate
is only available with a 32-bit immediate, so the machine code will look like B9 09 00 00 00
. vs. 6a 09
push 9 ; 59
pop ecx.
(ECX is register number 1
, which is where B9
and 59
come from: the low 3 bits of the instruction = 001
)
The other use-case is purely code-size: mov r32, imm32
is 5 bytes (using the no ModRM encoding that puts the register number in the low 3 bits of the opcode), because x86 unfortunately lacks a sign-extended imm8 opcode for mov
(there's no mov r/m32, imm8
). That exists for nearly all ALU instructions that date back to 8086.
In 16-bit 8086, that encoding wouldn't have saved any space: the 3-byte short-form mov r16, imm16
would be just as good as a hypothetical mov r/m16, imm8
for almost everything, except moving an immediate to memory where the mov r/m16, imm16
form (with a ModRM byte) is needed.
Since 386's 32-bit mode didn't add new opcodes, just changed the default operand-size and immediate widths, this "missed optimization" in the ISA in 32-bit mode started with 386. With full-width immediates being 2 bytes longer, an add r32,imm32
is now longer than an add r/m32, imm8
. See x86 assembly 16 bit vs 8 bit immediate operand encoding. But we don't have that option for mov
because there's no MOV opcode that sign-extends (or zero-extends) its immediate.
Fun fact: clang -Oz
(optimize for size even at the expense of speed) will compile int foo(){return 9;}
to push 9
; pop rax
.
See also Tips for golfing in x86/x64 machine code on Codegolf.SE (a site about optimizing for size usually for fun, rather than to fit code into a small ROM or boot sector. But for machine code, optimizing for size does have practical applications sometimes, even at the expense of performance.)
If you already had another register with known contents, creating 9 in another register can be done with 3-byte lea ecx, [eax-0 + 9]
(if EAX holds 0
). Just Opcode + ModRM + disp8. So you can avoid the push/pop hack if you already were going to xor-zero any other register. lea
is barely less efficient than mov
, and you could consider it when optimizing for speed because smaller code-size has minor speed benefits in the large scale: L1i cache hits, and sometimes decode if the uop cache isn't already hot.
Essentially the same exact thing. push 9 to stack then pop it into ecx register which is basically the same as mov ecx, 9. Personally I think 9 to ecx is probably more efficient then pushing 9 to the stack and then popping it into ecx but i think the processing time is not an issue so they both equally fast considering how small the code is either way.
mov ecx, 9
does have zeros in its encoding. I can see this for a few reasons a) the programmer was new at assembly and was bad code, (b) it is shorter encoding than themov
(c) there is a label between the push and pop and the pop is at the top of a loop, (d) someone was trying to align top of loop on 16 byte boundary, (e) someone was coding to avoid NUL bytes in the encoding (shell exploits) – Michael Petch