Your distro configured gcc with --enable-default-pie
, so it's making position-independent executables by default, (allowing for ASLR of the executable as well as libraries). Most distros are doing that, these days.
You actually are making a shared object: PIE executables are sort of a hack using a shared object with an entry-point. The dynamic linker already supported this, and ASLR is nice for security, so this was the easiest way to implement ASLR for executables.
32-bit absolute relocation aren't allowed in an ELF shared object; that would stop them from being loaded outside the low 2GiB (for sign-extended 32-bit addresses). 64-bit absolute addresses are allowed, but generally you only want that for jump tables or other static data, not as part of instructions.1
The recompile with -fPIC
part of the error message is bogus for hand-written asm; it's written for the case of people compiling with gcc -c
and then trying to link with gcc -shared -o foo.so *.o
, with a gcc where -fPIE
is not the default. The error message should probably change because many people are running into this error when linking hand-written asm.
How to use RIP-relative addressing: basics
Always use RIP-relative addressing for simple cases where there's no downside. See also footnote 1 below and this answer for syntax. Only consider using 32-bit absolute addressing when it's actually helpful for code-size instead of harmful. e.g. NASM default rel
at the top of your file.
AT&T foo(%rip)
or in GAS .intel_syntax noprefix
use [rip + foo]
.
Disable PIE mode to make 32-bit absolute addressing work
Use gcc -fno-pie -no-pie
to override this back to the old behaviour. -no-pie
is the linker option, -fno-pie
is the code-gen option. With only -fno-pie
, gcc will make code like mov eax, offset .LC0
that doesn't link with the still-enabled -pie
.
(clang can have PIE enabled by default, too: use clang -fno-pie -nopie
. A July 2017 patch made -no-pie
an alias for -nopie
, for compat with gcc, but clang4.0.1 doesn't have it.)
Performance cost of PIE for 64-bit (minor) or 32-bit code (major)
With only -no-pie
, (but still -fpie
) compiler-generated code (from C or C++ sources) will be slightly slower and larger than necessary, but will still be linked into a position-dependent executable which won't benefit from ASLR. "Too much PIE is bad for performance" reports an average slowdown of 3% for x86-64 on SPEC CPU2006 (I don't have a copy of the paper so IDK what hardware that was on :/). But in 32-bit code, the average slowdown is 10%, worst-case 25% (on SPEC CPU2006).
The penalty for PIE executables is mostly for stuff like indexing static arrays, as Agner describes in the question, where using a static address as a 32-bit immediate or as part of a [disp32 + index*4]
addressing mode saves instructions and registers vs. a RIP-relative LEA to get an address into a register. Also 5-byte mov r32, imm32
instead of 7-byte lea r64, [rel symbol]
for getting a static address into a register is nice for passing the address of a string literal or other static data to a function.
-fPIE
still assumes no symbol-interposition for global variables / functions, unlike -fPIC
for shared libraries which have to go through the GOT to access globals (which is yet another reason to use static
for any variables that can be limited to file scope instead of global). See The sorry state of dynamic libraries on Linux.
Thus -fPIE
is much less bad than -fPIC
for 64-bit code, but still bad for 32-bit because RIP-relative addressing isn't available. See some examples on the Godbolt compiler explorer. On average, -fPIE
has a very small performance / code-size downside in 64-bit code. The worst case for a specific loop might only be a few %. But 32-bit PIE can be much worse.
None of these -f
code-gen options make any difference when just linking,
or when assembling .S
hand-written asm. gcc -fno-pie -no-pie -O3 main.c nasm_output.o
is a case where you want both options.
Checking your GCC config
If your GCC was configured this way, gcc -v |& grep -o -e '[^ ]*pie'
prints --enable-default-pie
. Support for this config option was added to gcc in early 2015. Ubuntu enabled it in 16.10, and Debian around the same time in gcc 6.2.0-7
(leading to kernel build errors: https://lkml.org/lkml/2016/10/21/904).
Related: Build compressed x86 kernels as PIE was also affected by the changed default.
Why doesn't Linux randomize the address of the executable code segment? is an older question about why it wasn't the default earlier, or was only enabled for a few packages on older Ubuntu before it was enabled across the board.
Note that ld
itself didn't change its default. It still works normally (at least on Arch Linux with binutils 2.28). The change is that gcc
defaults to passing -pie
as a linker option, unless you explicitly use -static
or -no-pie
.
In a NASM source file, I used a32 mov eax, [abs buf]
to get an absolute address. (I was testing if the 6-byte way to encode small absolute addresses (address-size + mov eax,moffs: 67 a1 40 f1 60 00
) has an LCP stall on Intel CPUs. It does.)
nasm -felf64 -Worphan-labels -g -Fdwarf testloop.asm &&
ld -o testloop testloop.o # works: static executable
gcc -v -nostdlib testloop.o # doesn't work
...
..../collect2 ... -pie ...
/usr/bin/ld: testloop.o: relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
gcc -v -no-pie -nostdlib testloop.o # works
gcc -v -static -nostdlib testloop.o # also works: -static implies -no-pie
GCC can also make a "static PIE" with -static-pie
; ASLRed by no dynamic libraries or ELF interpreter. Not the same thing as -static -pie
- those conflict with each other (you get a static non-PIE) although it might possibly get changed.
related: building static / dynamic executables with/without libc, defining _start
or main
.
Checking if an existing executable is PIE or not
This has also been asked at: How to test whether a Linux binary was compiled as position independent code?
file
and readelf
say that PIEs are "shared objects", not ELF executables. ELF-type EXEC can't be PIE.
$ gcc -fno-pie -no-pie -O3 hello.c
$ file a.out
a.out: ELF 64-bit LSB executable, ...
$ gcc -O3 hello.c
$ file a.out
a.out: ELF 64-bit LSB shared object, ...
## Or with a more recent version of file:
a.out: ELF 64-bit LSB pie executable, ...
gcc -static-pie
is a special thing that GCC doesn't do by default, even with -nostdlib
. It shows up as LSB pie executable
, dynamically linked
with current versions of file
. (See What's the difference between "statically linked" and "not a dynamic executable" from Linux ldd?). It has ELF-type DYN, but readelf
shows no .interp
, and ldd
will tell you it's statically linked. GDB starti
and /proc/maps
confirms that execution starts at the top of its _start
, not in an ELF interpreter.
Semi-related (but not really): another recent gcc feature is gcc -fno-plt
. Finally calls into shared libraries can be just call [rip + symbol@GOTPCREL]
(AT&T call *puts@GOTPCREL(%rip)
), with no PLT trampoline.
The NASM version of this is call [rel puts wrt ..got]
as an alternative to call puts wrt ..plt
. See Can't call C standard library function on 64-bit Linux from assembly (yasm) code. This works in a PIE or non-PIE, and avoids having the linker build a PLT stub for you.
Some distros have started enabling it. It also avoids needing writeable + executable memory pages so it's good for security against code-injection. (I think modern PLT implementation's don't need that either, just updating a GOT pointer not rewriting a jmp rel32
instruction, so there might not be a security difference.)
It's a significant speedup for programs that make a lot of shared-library calls, e.g. x86-64 clang -O2 -g
compiling tramp3d goes from 41.6s to 36.8s on whatever hardware the patch author tested on. (clang is maybe a worst-case scenario for shared library calls, making lots of calls to small LLVM library functions.)
It does require early binding instead of lazy dynamic linking, so it's slower for big programs that exit right away. (e.g. clang --version
or compiling hello.c
). This slowdown could be reduced with prelink, apparently.
This doesn't remove the GOT overhead for external variables in shared library PIC code, though. (See the godbolt link above).
Footnotes 1
64-bit absolute addresses actually are allowed in Linux ELF shared objects, with text relocations to allow loading at different addresses (ASLR and shared libraries). This allows you to have jump tables in section .rodata
, or static const int *foo = &bar;
without a runtime initializer.
So mov rdi, qword msg
works (NASM/YASM syntax for 10-byte mov r64, imm64
, aka AT&T syntax movabs
, the only instruction which can use a 64-bit immediate). But that's larger and usually slower than lea rdi, [rel msg]
, which is what you should use if you decide not to disable -pie
. A 64-bit immediate is slower to fetch from the uop cache on Sandybridge-family CPUs, according to Agner Fog's microarch pdf. (Yes, the same person who asked this question. :)
You can use NASM's default rel
instead of specifying it in every [rel symbol]
addressing mode. See also Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array for some more description of avoiding 32-bit absolute addressing. OS X can't use 32-bit addresses at all, so RIP-relative addressing is the best way there, too.
In position-dependent code (-no-pie
), you should use mov edi, msg
when you want an address in a register; 5-byte mov r32, imm32
is even smaller than RIP-relative LEA, and more execution ports can run it.