5
votes

In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:

    memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8             mov         eax,dword ptr [ebp-8]  
0041A5DD 50                   push        eax  
0041A5DE 8B 4D 0C             mov         ecx,dword ptr [ebp+0Ch]  
0041A5E1 51                   push        ecx  
0041A5E2 8B 55 D4             mov         edx,dword ptr [ebp-2Ch]  
0041A5E5 52                   push        edx  
0041A5E6 E8 67 92 FF FF       call        00413852  
0041A5EB 83 C4 0C             add         esp,0Ch 

Why not just push them directly? ie

push  dword ptr [ebp-8]

Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do

mov [esp], eax

Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.

UPDATE---Release version

This is the same code compiled for release:

; 741  :    memcpy( nodeNewLocation, pNode, sizeCurrentNode );

  00087 8b 45 f8     mov     eax, DWORD PTR _sizeCurrentNode$[ebp]
  0008a 8b 7b 04     mov     edi, DWORD PTR [ebx+4]
  0008d 50       push    eax
  0008e 56       push    esi
  0008f 57       push    edi
  00090 e8 00 00 00 00   call    _memcpy
  00095 83 c4 0c     add     esp, 12            ; 0000000cH

Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.

3
Is that actually compiled in release mode? It looks vaguely debuggishharold
It is compiled for debug. Why would that make a difference in this case?Tyler Durden
Because the compiler is not going to care about such things in debug mode.harold
In your final example, is it safe to leave the stack temporarily unbalanced by deferring the sub? I know that would be bad news in real mode (interrupt "borrows" part of your stack at an inopportune time), but I am not certain in protected mode.Brian Knoblauch
By decoupling the instructions, you reduce the number of register stalls.Raymond Chen

3 Answers

5
votes

This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:

In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space and address adjustment between function calls/returns will be more optimal than using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM flows and stack pointer address update is done at AGU. When a callee function need to return to the caller, the callee could issue POP instruction to restore data and restore the stack pointer from the EBP.

Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA to adjust ESP instead of ADD/SUB.

The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.

1
votes

I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:

 push ebx          ; mthd
 push dword ptr [ebp+place+4]
 push dword ptr [ebp+place] ; pos
 push [ebp+filedes]   ; fh
 call __lseeki64_nolock

(code from the CRT)

As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.

1
votes

I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:

mov eax, [indirect]
mov esi, [indirect]
push eax
push esi

then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.