Understanding Assembly Hello World

Question

I have this Hello World example that is part of a course I'm using for learning assembly:

push ebp
mov ebp, esp
push offset aHelloWorld; "Hello world\n"
call ds:__imp__printf
add esp, 4
mov eax, 1234h
pop ebp
retn

This code was generated by Windows Visual C++ 2005 with buffer overflow protection turned off and disassembled with IDA Pro 4.9 Free Version.

I'm trying to understand what each line does.

the first line is push ebp.

I know ebp stands for base pointer. What is its function?

I see that in the second line the value in esp is moved into ebp and searching online I see that there first 2 instructions are very common at the beginning of an assembly program.

Though are ebp and esp empty at the beginning? I'm new to assembly. Is ebp used for stack frames, so when we have a function in our code and is it optional for a simple program?

Then push offset aHelloWorld; "Hello world\n"

The part after ; is a comment so it doesn't get executed right? The first part instead adds the address containing the string Hello World to the stack, right? But where is the string declared? I'm not sure I understand.

Then call ds:__imp__printf

it seems it's a call to a function, anyway printf is a builtin function right? And does ds stand for data segment register? Is it used because we are trying to access a memory operand that isn't on the stack?

then add esp, 4

do we add 4 bytes to esp? Why?

then move eax, 1234h what is 1234h here?

then pop ebx..it was pushed at the beginning. is it necessary to pop it at the end?

then retn ( i knew about ret for returning a value after calling a function). I read that the n in retn refers to the number of pushed arguments by the caller. It isn't very clear for me. Can you help me to understand?

I wonder why do you want to develop exploits? The current SW is already quite faulty as is, and working many times just by pure accident. Nowadays it's often more effort to not bring it down by regular usage, than to crash it by exploit. Too scared to do some real coding, so you want to stay with the easy side? — Ped7g
I work in information security and I want to learn exploit development as well. Simple. Anyway this is just an hello word example that does no harm. Can you help me to understand the code well? — Fabio
I'm trying, but you have so many questions, and you are so green to this, that it would probably make more sense to return to some tutorials and documents, also some computer architecture books, etc... I'm writing down some partial answer, but don't expect much. — Ped7g
BTW, this "simple no-harm hello world" looks like it is loading the printf from dynamic library. So if you would somehow (by some attack) inject malicious dll during execution of this, with patched malicious printf version, it may do lot of harm (at least in the context under which you run the hello world, unless that malicious code uses some other bug to escalate it's privileges and escape the current context, etc...). ... so much for the "no harm" ... :D — Ped7g
there are even different syntaxes, different assemblers, differences between an operating system and another..differences from an architecture and another (although now I'm focusing on x86), so getting lost is easy and it requires many prerequisites.. — Fabio

Cody Gray Cody Gray · Accepted Answer · 2016-09-08T10:40:30

I'm trying to understand what each line does.

That would fall under the general category of learning assembly language. There are entire books written about this topic; some of them are probably even pretty good. You should purchase one. To ensure that you get maximum bang for your buck, be sure to select one that focuses on the architecture and operating system you're interested in. x86 assembly language is, of course, always the same, but the programming model differs enough between Windows and Linux that the differences would be confusing to a beginner.

If you're too cheap to buy a book, at least read Matt Pietrek's classic series of articles, "Just Enough Assembly To Get By", from the Microsoft System Journal. Start here, and proceed to the follow-up.

The first line is push ebp. I know ebp stands for base pointer. What is its function?

I see that in the second line the value in esp is moved into ebp and searching online I see that there first 2 instructions are very common at the beginning of an assembly program.

I'm new to assembly. Is ebp used for stack frames, so when we have a function in our code and is it optional for a simple program?

To understand this first line in isolation, you just need to know what a PUSH instruction does. It pushes the operand (in this case, a register) onto the top of the stack. EBP is the register that almost always contains the stack base pointer.

That doesn't tell you much about the purpose of this code, though. This line and the next one are part of the standard function prologue. Matt talks about that near the beginning of his very first article, in the "Procedure Entry and Exit" section. First, the stack base pointer from EBP is saved by PUSHing it onto the stack. Then, the second instruction copies the value of ESP into the EBP register. This makes interacting with the stack throughout the function easier. Generally, the prologue section would end with an instruction that reserved an arbitrary amount of space on the stack for temporary variables (e.g., sub esp, 8 to reserve 8 bytes on the stack). This function doesn't need any.

Yes, this prologue code is optional. If you don't need any stack space and/or you use EBP-relative addressing, then you don't need the standard prologue. Optimizing compilers often omit it when possible.

Though are ebp and esp empty at the beginning?

No, of course they are not empty. If they were empty, the code wouldn't bother to save the value of EBP or use the value of ESP.

In fact, no registers are empty at the beginning of a function. They contain either the values that the function's prototype (in conjunction with its calling convention) says that they do, they contain values that you must preserve (that is, they must still have the same values when your function returns control that they did when your function was first called; these are called caller-save registers, and which ones they are differ depending on the calling convention), or they contain what you can assume to be garbage values (these are the callee-save registers and you are free to clobber them in the callee function's code).

Then push offset aHelloWorld; "Hello world\n"

The part after ; is a comment so it doesn't get executed right? The first part instead adds the address containing the string Hello World to the stack, right? But where is the string declared? I'm not sure I understand.

aHelloWorld is a piece of global data declared in the executable image. It was put there at link time, probably because the original code used a string literal. This instruction PUSHes the offset of that global data (that is, its address) onto the stack.

Yes, the part after the semicolon is a comma. The disassembler is adding this comment as a favor to you. It has looked up the value of aHelloWorld, determined that it contains the string Hello world\n, and placed that definition in-line, saving you from having to look up the data's value yourself.

Then call ds:__imp__printf

it seems it's a call to a function, anyway printf is a builtin function right?

Yes, CALL always calls a function. In this case, it is calling the printf function. Is it a "built-in" function? That depends on your definition. From the perspective of assembly language, no: no function is built-in. printf is a function provided by the C standard library. When the original code was compiled and linked, it was also linked with the C run-time library, which provides the C standard library functions, including printf. Since this is MSVC, the __imp__ prefix is a big hint that the function being called is part of either the standard library or the Windows API. These are implicitly linked functions.

Looking up the printf function shows that it takes a variable number of arguments. In the most common x86-32 calling conventions, these arguments are passed on the stack. So that explains why the previous instruction PUSHed the address of string data onto the stack: it's passing that address to the printf function so that string can be printed to the standard output. It could have passed additional arguments to printf, but it didn't, because it didn't need to: it just needed one to print a literal string.

And does ds stand for data segment register? Is it used because we are trying to access a memory operand that isn't on the stack?

Yes, DS is the data segment. Your disassembler is just being verbose here. In Windows, x86-32 uses a flat memory model, so you can basically ignore the segment registers entirely and still understand everything that is going on perfectly well.

then add esp, 4

do we add 4 bytes to esp? Why?

Yes, this adds 4 bytes to the ESP register. Why? To clean up the stack. Recall that before CALLing the printf function, you PUSHed a 4-byte value (the offset of the string data in the executable image) on the stack. The printf function is variadic (takes a variable number of arguments), so the caller is always responsible for cleaning up the stack after calling it.

Here, you can think of adding 4 to ESP is equivalent to popping the stack with a POP instruction. On x86, the stack always grows downwards, so adding is equivalent to popping (and the inverse of pushing).

then move eax, 1234h what is 1234h here?

This instruction MOVes the constant value 0x1234 (the h means hexadecimal) into the EAX register.

Why? Well, I can guess. In all of the x86 calling conventions, the EAX register contains a function's return value. So it is very likely that the function's original code ended with return 0x1234;.

then pop ebx..it was pushed at the beginning. is it necessary to pop it at the end?

Actually, it pops EBP, which is what was actually pushed at the beginning of the function.

And yes. Everything that you PUSH onto the stack has to be POPed off the stack. (Or equivalent, as we saw earlier with ADDing to ESP.) You have to clean up the stack. This is the function epilogue that corresponds to the prologue that we saw at the beginning. Refer back to Matt's article, where it talks about "Procedure Entry and Exit".

then retn ( i knew about ret for returning a value after calling a function). I read that the n in retn refers to the number of pushed arguments by the caller.

This is just an idiosyncracy of your disassembler again. IDA Pro uses the retn mnemonic. This actually means a near return, but since x86-32 uses a flat (non-segmented) memory model, the near vs. far distinction is not relevant. You can think of retn as simply being equivalent to ret.

Note that this is distinct from the ret instruction that takes an argument, which is what you're thinking of. It doesn't "return" its argument, though. The function returns its result in the EAX register. Rather, ret n (where n is 16-byte immediate value) returns and pops the specified number of bytes off the stack. This is used only for certain calling conventions (most commonly __stdcall) where the callee is responsible for cleaning up the stack.

See links in the x86 tag wiki and Wikipedia for more information on calling conventions.

It isn't very clear for me. Can you help me to understand?

Did I mention you should get a book that teaches assembly language programming?

Understanding Assembly Hello World

1 Answers