56
votes

I write empty programs to annoy the hell out of stackoverflow coders, NOT. I am just exploring the gnu toolchain.

Now the following might be too deep for me, but to continuie the empty program saga I have started to examine the output of the C compiler, the stuff GNU as consumes.

gcc version 4.4.0 (TDM-1 mingw32)

test.c:

int main()
{
    return 0;
}

gcc -S test.c

    .file   "test.c"
    .def    ___main;    .scl    2;  .type   32; .endef
    .text
.globl _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
    pushl   %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    call    ___main
    movl    $0, %eax
    leave
    ret 

Can you explain what happens here? Here is my effort to understand it. I have used the as manual and my minimal x86 ASM knowledge:

  • .file "test.c" is the directive for the logical filename.
  • .def: according to the docs "Begin defining debugging information for a symbol name". What is a symbol (a function name/variable?) and what kind of debugging information?
  • .scl: docs say "Storage class may flag whether a symbol is static or external". Is this the same static and external I know from C? And what is that '2'?
  • .type: stores the parameter "as the type attribute of a symbol table entry", I have no clue.
  • .endef: no problem.
  • .text: Now this is problematic, it seems to be something called section and I have read that its the place for code, but the docs didn't tell me too much.
  • .globl "makes the symbol visible to ld.", the manual is quite clear on this.
  • _main: This might be the starting address (?) for my main function
  • pushl_: A long (32bit) push, which places EBP on the stack
  • movl: 32-bit move. Pseudo-C: EBP = ESP;
  • andl: Logical AND. Pseudo-C: ESP = -16 & ESP, I don't really see whats the point of this.
  • call: Pushes the IP to the stack (so the called procedure can find its way back) and continues where __main is. (what is __main?)
  • movl: this zero must be the constant I return at the end of my code. The MOV places this zero into EAX.
  • leave: restores stack after an ENTER instruction (?). Why?
  • ret: goes back to the instruction address that is saved on the stack

Thank you for your help!

5
Good question. :)Johannes Schaub - litb
Sounds like an excellent exercise for a true geek.JesperE
I found the COFF specification. This should give some references to what "32" in ".type" means etc: microsoft.com/whdc/system/platform/firmware/PECOFFdwn.mspxJohannes Schaub - litb

5 Answers

56
votes

.file "test.c"

Commands starting with . are directives to the assembler. This just says this is "file.c", that information can be exported to the debugging information of the exe.

.def ___main; .scl 2; .type 32; .endef

.def directives defines a debugging symbol. scl 2 means storage class 2(external storage class) .type 32 says this sumbol is a function. These numbers will be defined by the pe-coff exe-format

___main is a function called that takes care of bootstrapping that gcc needs(it'll do things like run c++ static initializers and other housekeeping needed).

.text

Begins a text section - code lives here.

.globl _main

defines the _main symbol as global, which will make it visible to the linker and to other modules that's linked in.

.def        _main;  .scl    2;      .type   32;     .endef

Same thing as _main , creates debugging symbols stating that _main is a function. This can be used by debuggers.

_main:

Starts a new label(It'll end up an address). the .globl directive above makes this address visible to other entities.

pushl       %ebp

Saves the old frame pointer(ebp register) on the stack (so it can be put back in place when this function ends)

movl        %esp, %ebp

Moves the stack pointer to the ebp register. ebp is often called the frame pointer, it points at the top of the stack values within the current "frame"(function usually), (referring to variables on the stack via ebp can help debuggers)

andl $-16, %esp

Ands the stack with fffffff0 which effectivly aligns it on a 16 byte boundary. Access to aligned values on the stack are much faster than if they were unaligned. All these preceding instructions are pretty much a standard function prologue.

call        ___main

Calls the ___main function which will do initializing stuff that gcc needs. Call will push the current instruction pointer on the stack and jump to the address of ___main

movl        $0, %eax

move 0 to the eax register,(the 0 in return 0;) the eax register is used to hold function return values for the stdcall calling convention.

leave

The leave instruction is pretty much shorthand for

movl     ebp,esp
popl     ebp

i.e. it "undos" the stuff done at the start of the function - restoring the frame pointer and stack to its former state.

ret

Returns to whoever called this function. It'll pop the instruction pointer from the stack (which a corresponding call instruction will have placed there) and jump there.

12
votes

There's a very similar exercise outlined here: http://en.wikibooks.org/wiki/X86_Assembly/GAS_Syntax

You've figured out most of it -- I'll just make additional notes for emphasis and additions.

__main is a subroutine in the GNU standard library that takes care of various start-up initialization. It is not strictly necessary for C programs but is required just in case the C code is linking with C++.

_main is your main subroutine. As both _main and __main are code locations they have the same storage class and type. I've not yet dug up the definitions for .scl and .type yet. You may get some illumination by defining a few global variables.

The first three instructions are setting up a stack frame which is a technical term for the working storage of a subroutine -- local and temporary variables for the most part. Pushing ebp saves the base of the caller's stack frame. Putting esp into ebp sets the base of our stack frame. The andl aligns the stack frame to a 16 byte boundary just in case any local variables on the stack require 16 byte alignment (for the x86 SIMD instructions require that alignment, but alignment does speed up ordinary types such as ints and floats.

At this point you'd normally expect esp to get moved down in memory to allocate stack space for local variables. Your main has none so gcc doesn't bother.

The call to __main is special to the main entry point and won't typically appear in subroutines.

The rest goes as you surmised. Register eax is the place to put integer return codes in the binary spec. leave undoes the stack frame and ret goes back to the caller. In this case, the caller is the low-level C runtime which will do additional magic (like calling atexit() functions, set the exit code for the process and ask the operating system to terminate the process.

5
votes

Regarding that andl $-16,%esp

  • 32 bits: -16 in decimal equals to 0xfffffff0 in hexadecimal representation
  • 64 bits: -16 in decimal equals to 0xfffffffffffffff0 in hexadecimal representation

So it will mask off the last 4 bits of ESP (btw: 2**4 equals to 16) and will retain all other bits (no matter if the target system is 32 or 64 bits).

4
votes

Further to the andl $-16,%esp, this works because setting the low bits to zero will always adjust %esp down in value, and the stack grows downward on x86.

2
votes

I don't have all answers but I can explain what I know.

ebp is used by the function to store the initial state of esp during its flow, a reference to where are the arguments passed to the function and where are its own local variables. The first thing a function does is to save the status of the given ebp doing pushl %ebp, it is vital to the function that make the call, and than replaces it by its own current stack position esp doing movl %esp, %ebp. Zeroing the last 4 bits of ebp at this point is GCC specific, I don't know why this compiler does that. It would work without doing it. Now finally we go into business, call ___main, who is __main? I don't know either... maybe more GCC specific procedures, and finally the only thing your main() does, set return value as 0 with movl $0, %eax and leave which is the same as doing movl %ebp, %esp; popl %ebp to restore ebp state, then ret to finish. ret pops eip and continue thread flow from that point, wherever it is (as its the main(), this ret probably leads to some kernel procedure which handles the end of the program).

Most of it is all about managing the stack. I wrote a detailed tutorial about how stack is used some time ago, it would be useful to explain why all those things are made. But its in portuguese...