8
votes

I have an assembly hello world program for Mac OS X that looks like this:

global _main


section .text

_main:
    mov rax, 0x2000004
    mov rdi, 1
    lea rsi, [rel msg]
    mov rdx, msg.len
    syscall

    mov rax, 0x2000001
    mov rdi, 0
    syscall


section .data

msg:    db  "Hello, World!", 10
.len:   equ $ - msg

I was wondering about the line lea rsi, [rel msg]. Why does NASM force me to do that? As I understand it, msg is just a pointer to some data in the executable and doing mov rsi, msg would put that address into rsi. But if I replace the line lea rsi, [rel msg] with , NASM throws this error (note: I am using the command nasm -f macho64 hello.asm):

hello.asm:9: fatal: No section for index 2 offset 0 found

Why does this happen? What is so special about lea that mov can't do? How would I know when to use each one?

2
I think Jester already answered this question. The Mach-O object file format requires everything to be position independent. The means your code needs to able to loaded at any address and still work. The mov rsi, msg uses an absolute address that would have to change depending on where the program is loaded, and Mach-O doesn't support that.Ross Ridge
@RossRidge But aren't 'absolute addresses' actually relative to the beginning of the executable?Jerfov2
The CPU doesn't know where the executable starts. When it executes the mov rsi, msg instruction it loads the register with the value encoded as an immediate operand. That immediate value needs to be the actual address of msg. Mach-O doesn't support that.Ross Ridge
@RossRidge Does the executable know where its going to be loaded? If not, how would it know where the address of msg will be?Jerfov2
With Mach-O the executable doesn't know where it will be loaded. It doesn't know where msg will be located. By using RIP relative addressing it doesn't need to.Ross Ridge

2 Answers

10
votes

What is so special about lea that mov can't do?

mov reg,imm loads an immediate constant into its destination operand. Immediate constant is encoded directly in the opcode, e.g. mov eax,someVar would be encoded as B8 EF CD AB 00 if address of someVar is 0x00ABCDEF. I.e. to encode such an instruction with imm being address of msg you need to know exact address of msg. In position-independent code you don't know it a priori.

mov reg,[expression] loads the value located at address described by expression. The complex encoding scheme of x86 instructions allows to have quite complex expression: in general it's reg1+reg2*s+displ, where s can be 0,1,2,4, reg1 and reg2 can be general-purpose registers or zero, and displ is immediate displacement. In 64-bit mode expression can have one more form: RIP+displ, i.e. the address is calculated relative to the next instruction.

lea reg,[expression] uses all this complex way of calculating addresses to load the address itself into reg (unlike mov, which dereferences the address calculated). Thus the information, unavailable at compilation time, namely absolute address which would be in RIP, can be encoded in the instruction without knowing its value. The nasm expression lea rsi,[rel msg] gets translated into something like

    lea rsi,[rip+(msg-nextInsn)]
nextInsn:

which uses the relative address msg-nextInsn instead of absolute address of msg, thus allowing the assembler to not know the actual address but still encode the instruction.

9
votes

What is so special about lea that mov can't do?

LEA r, [rel symbol] can access RIP at run-time. mov r, imm can't. The immediate constant is encoded into the binary representation of the instruction, which means it won't work if the code+data are mapped to an address that isn't known at link time. (i.e. it's position-dependent code.)

This is why RIP-relative addressing is so nice for PIC (position-independent code): instead of needing a level of indirection through the Global Offset Table to access even static data defined in the same object file, you can just use RIP-relative addresses.

It also efficiently gives you a 64-bit address without needing a full 64-bit absolute embedded in the instruction. MacOS X requires 64-bit addresses because it maps the "image base" outside the low 4GiB of virtual address space.

It's a good thing if executables (not just shared libraries) are PIC, so MacOS can randomize their base address for more security. (Without having to rewrite absolute addresses anywhere they appear.)


In position-dependent Linux executables (not MacOS), you can as an optimization use
mov esi, msg. Note ESI, not RSI.
mov rsi, msg would be less efficient, using a 10-byte mov rsi, imm64 instead of a 7-byte lea rsi, [RIP + rel32]. (How to load address of function or label into register)

The "normal" way to access static data in x86-64 is with RIP-relative addressing, e.g. mov eax, [rel my_global_var]. It's only for putting the address into a register that you might sometimes take advantage of 32-bit absolute, if the target allows 32-bit absolute.

Other related Q&As: