5
votes

I have some sample code from a shell code payload showing a for loop and using push/pop to set the counter:

push 9
pop ecx

Why can it not just use mov?

mov ecx, 9
3
Any chance you saw this in an exploit or shellcode? I ask because this technique has the advantage that it doesn't add NUL(0) bytes to the encoding. mov ecx, 9 does have zeros in its encoding. I can see this for a few reasons a) the programmer was new at assembly and was bad code, (b) it is shorter encoding than the mov (c) there is a label between the push and pop and the pop is at the top of a loop, (d) someone was trying to align top of loop on 16 byte boundary, (e) someone was coding to avoid NUL bytes in the encoding (shell exploits)Michael Petch
(f) this was compiler generated and could have been the result of the code generation and limited/no optimizations (missed optimization)Michael Petch
This could also be shell code which needs to avoid NUL bytes.fuz
Thanks, this was from a malicious code sample :)Hawke
No problem. Had a hunch it was probably an exploit ;-). Was why I felt compelled to ask in my first comment. It is the one place it actually makes the most sense. I'll update the tag and question.Michael Petch

3 Answers

7
votes

Yes normally you should always use mov ecx, 9 for performance reasons. It runs more efficiently than push/pop`, as a single-uop instruction that can run on any port. (This is true across all existing CPUs that Agner Fog has tested: https://agner.org/optimize/)


The normal reason for push imm8 / pop r32 is that the machine code is free of zero bytes. This is important for shellcode that has to overflow a buffer via strcpy or any other method that treats it as part of an implicit-length C string terminated by a 0 byte.

mov ecx, immediate is only available with a 32-bit immediate, so the machine code will look like B9 09 00 00 00. vs. 6a 09 push 9 ; 59 pop ecx.

(ECX is register number 1, which is where B9 and 59 come from: the low 3 bits of the instruction = 001)


The other use-case is purely code-size: mov r32, imm32 is 5 bytes (using the no ModRM encoding that puts the register number in the low 3 bits of the opcode), because x86 unfortunately lacks a sign-extended imm8 opcode for mov (there's no mov r/m32, imm8). That exists for nearly all ALU instructions that date back to 8086.

In 16-bit 8086, that encoding wouldn't have saved any space: the 3-byte short-form mov r16, imm16 would be just as good as a hypothetical mov r/m16, imm8 for almost everything, except moving an immediate to memory where the mov r/m16, imm16 form (with a ModRM byte) is needed.

Since 386's 32-bit mode didn't add new opcodes, just changed the default operand-size and immediate widths, this "missed optimization" in the ISA in 32-bit mode started with 386. With full-width immediates being 2 bytes longer, an add r32,imm32 is now longer than an add r/m32, imm8. See x86 assembly 16 bit vs 8 bit immediate operand encoding. But we don't have that option for mov because there's no MOV opcode that sign-extends (or zero-extends) its immediate.

Fun fact: clang -Oz (optimize for size even at the expense of speed) will compile int foo(){return 9;} to push 9 ; pop rax.

See also Tips for golfing in x86/x64 machine code on Codegolf.SE (a site about optimizing for size usually for fun, rather than to fit code into a small ROM or boot sector. But for machine code, optimizing for size does have practical applications sometimes, even at the expense of performance.)

If you already had another register with known contents, creating 9 in another register can be done with 3-byte lea ecx, [eax-0 + 9] (if EAX holds 0). Just Opcode + ModRM + disp8. So you can avoid the push/pop hack if you already were going to xor-zero any other register. lea is barely less efficient than mov, and you could consider it when optimizing for speed because smaller code-size has minor speed benefits in the large scale: L1i cache hits, and sometimes decode if the uop cache isn't already hot.

2
votes

This may have different reasons.

In this case this seems to be done because the code is smaller:

The variant with the push and the pop combination is 3 bytes long, the mov instruction is 5 bytes long.

However, I would guess that the mov variant is faster ...

0
votes

Essentially the same exact thing. push 9 to stack then pop it into ecx register which is basically the same as mov ecx, 9. Personally I think 9 to ecx is probably more efficient then pushing 9 to the stack and then popping it into ecx but i think the processing time is not an issue so they both equally fast considering how small the code is either way.