16
votes

I've read in different places that it is done for "performance reasons", but I still wonder what are the particular cases where performance get improved by this 16-byte alignment. Or, in any case, what were the reasons why this was chosen.

edit: I'm thinking I wrote the question in a misleading way. I wasn't asking about why the processor does things faster with 16-byte aligned memory, this is explained everywhere in the docs. What I wanted to know instead, is how the enforced 16-byte alignment is better than just letting the programmers align the stack themselves when needed. I'm asking this because from my experience with assembly, the stack enforcement has two problems: it is only useful by less 1% percent of the code that is executed (so in the other 99% is actually overhead); and it is also a very common source of bugs. So I wonder how it really pays off in the end. While I'm still in doubt about this, I'm accepting peter's answer as it contains the most detailed answer to my original question.

1
Because some SSE instructions require such alignment and in general it costs less to maintain the alignment than to re-align everywhere you want to use those. Also even when not required, you do not want to accidentally cross cache lines.Jester
This is related to the memory controller and CPU cashes. Memory controller in most cases can not read a single byte, it rather reads something about 64 kib bundle. So if stack is aligned to the memory controller bucket/CPU cache-line it improves performance dramatically.Victor Gubin
It makes it possible for the local variables to be easily aligned too because you have a known aligned reference point.Jester
Read [this](software.intel.com/en-us/forums/intel-iamd64 abi versionssa-extensions/topic/752392). Secondly, the alignment is either 16, 32 or 64B (in the 1.0rc version of the abi).Margaret Bloom
It would be cool to have some real numbers on perf and code-size of a different ABI design decision, though. It's possible that only 8-byte alignment would be better, now that vectorized code often wants 32 or 64-byte alignment, so 16-byte alignment isn't sufficient.Peter Cordes

1 Answers

20
votes

Note that the current version of the i386 System V ABI used on Linux also requires 16-byte stack alignment1. See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.

Windows x64 also requires 16-byte stack alignment before a call, presumably for similar motivations as x86-64 System V.

Also, semi-related: x86-64 System V requires that global arrays of 16 bytes and large be aligned by 16. Same for local arrays of >= 16 bytes or variable size, although that detail is only relevant across functions if you know that you're being passed the address of the start of an array, not a pointer into the middle. (Different memory alignment for different buffer sizes). It doesn't let you make any extra assumptions about an arbitrary int *.


SSE2 is baseline for x86-64, and making the ABI efficient for types like __m128, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.

16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.

If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an and rsp, -16, and extra instructions to save/restore rsp after an unknown offset to rsp (either 0 or -8). e.g. using up rbp for a frame pointer.

Without AVX, memory source operands have to be 16-byte aligned. e.g. paddd xmm0, [rsp+rdi] faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use movups xmm1, [rsp+rdi] / paddd xmm0, xmm1, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.

Also note that early x86 CPUs (before Nehalem / Bulldozer) had a movups instruction that's slower than movaps even when the pointer does turn out to be aligned. (i.e. unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction). (See Agner Fog's optimization guides, microarch guide, and instruction tables for more about all of the above.)

These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.

Aligned arrays also speed up vectorized memcpy / strcmp / whatever functions that can't assume alignment, but instead check for it and can jump straight to their whole-vector loops.

From a recent version of the x86-64 System V ABI (r252):

An array uses the same alignment as its elements, except that a local or global array variable of length at least 16 bytes or a C99 variable-length array variable always has alignment of at least 16 bytes.4

4 The alignment requirement allows the use of SSE instructions when operating on the array. The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at least a 16-byte alignment.

This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)


Spill/reload of __m128

Of course, it makes it free to do alignas(16) char buf[1024]; or other cases where the source requests 16-byte alignment.

And there are also __m128 / __m128d / __m128i locals. The compiler may not be able to keep all vector locals in registers (e.g. spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with movaps, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.

Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory).

I think movups already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is much worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks). Natural alignment makes splits across any wider boundary impossible, so 16-byte alignment was sufficient for everything you can do with SSE2.


max_align_t has 16-byte alignment in the x86-64 System V ABI, because of long double (10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where sizeof(long double) == 10. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of double or float on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of long double would be no slower with packed 10-byte, because the fld m80 would be a bigger bottleneck than a cache-line split every ~6.4 elements.

Actually, the ABI was defined before silicon was available to benchmark on (back in ~2000), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making long double 16-byte does make it possible to copy a single one with movaps, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with xorps / andps / orps)

Related: this max_align_t definition means that malloc always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like _mm_load_ps, but such code can break when compiled for 32-bit where alignof(max_align_t) is only 8. (Use aligned_alloc or whatever).


Other ABI factors include passing __m128 values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.

There are types like __m128 that the ABI guarantees have 16-byte alignment. If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.

These days, it's nice that atomic<struct_of_16_bytes> can cheaply get 16-byte alignment, so lock cmpxchg16b doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...


Footnote 1: 32-bit Linux

Not all 32-bit platforms broke backwards compatibility with existing binaries and hand-written asm the way Linux did; some like i386 NetBSD still only use the historical 4-byte stack alignment requirement from the original version of the i386 SysV ABI.

The historical 4-byte stack alignment was also insufficient for efficient 8-byte double on modern CPUs. Unaligned fld / fstp are generally efficient except when they cross a cache-line boundary (like other loads/stores), so it's not horrible, but naturally-aligned is nice.

Even before 16-byte alignment was officially part of the ABI, GCC used to enable -mpreferred-stack-boundary=4 (2^4 = 16-bytes) on 32-bit. This currently assumes the incoming stack alignment is 16 bytes (even for cases that will fault if it's not), as well as preserving that alignment. I'm not sure if historical gcc versions used to try to preserve stack alignment without depending on it for correctness of SSE code-gen or alignas(16) objects.

ffmpeg is one well-known example that depends on the compiler to give it stack alignment: what is "stack alignment"?, e.g. on 32-bit Windows.

Modern gcc still emits code at the top of main to align the stack by 16 (even on Linux where the ABI guarantees that the kernel starts the process with an aligned stack), but not at the top of any other function. You could use -mincoming-stack-boundary to tell gcc how aligned it should assume the stack is when generating code.

Ancient gcc4.1 didn't seem to really respect __attribute__((aligned(16))) or 32 for automatic storage, i.e. it doesn't bother aligning the stack any extra in this example on Godbolt, so old gcc has kind of a checkered past when it comes to stack alignment. I think the change of the official Linux ABI to 16-byte alignment happened as a de-facto change first, not a well-planned change. I haven't turned up anything official on when the change happened, but somewhere between 2005 and 2010 I think, after x86-64 became popular and the x86-64 System V ABI's 16-byte stack alignment proved useful.

At first it was a change to GCC's code-gen to use more alignment than the ABI required (i.e. using a stricter ABI for gcc-compiled code), but later it was written in to the version of the i386 System V ABI maintained at https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI (which is official for Linux at least).


@MichaelPetch and @ThomasJager report that gcc4.5 may have been the first version to have -mpreferred-stack-boundary=4 for 32-bit as well as 64-bit. gcc4.1.2 and gcc4.4.7 on Godbolt appear to behave that way, so maybe the change was backported, or Matt Godbolt configured old gcc with a more modern config.