Stacks are not, and never can be, unlimited in their space for growth. Like everything else, they live in the process's virtual address space, and the amount by which they can grow is always limited by the distance to the adjacent mapped memory region.
When people talk about the stack growing dynamically, what they might mean is one of two things:
- Pages of the stack might be copy-on-write zero pages, which do not get private copies made until the first write is performed.
- Lower parts of the stack region may not yet be reserved (and thus not count towards the process's commit charge, i.e. the amount of physical memory/swap the kernel has accounted for as reserved for the process) until a guard page is hit, in which case the kernel commits more and moves the guard page, or kills the process if there is no memory left to commit.
Trying to rely on the MAP_GROWSDOWN
flag is unreliable and dangerous because it cannot protect you against mmap
creating a new mapping just adjacent to your stack, which will then get clobbered. (See http://lwn.net/Articles/294001/) For the main thread, the kernel automatically reserves the stack-size ulimit
worth of address space (not memory) below the stack and prevents mmap
from allocating it. (But beware! Some broken vendor-patched kernels disable this behavior leading to random memory corruption!) For other threads, you simply must mmap
the entire range of address space the thread might need for stack when creating it. There is no other way. You could make most of it initially non-writable/non-readable, and change that on faults, but then you'd need signal handlers and this solution is not acceptable in a POSIX threads implementation because it would interfere with the application's signal handlers. (Note that, as an extension, the kernel could offer special MAP_
flags to deliver a different signal instead of SIGSEGV
on illegal access to the mapping, and then the threads implementation could catch and act on this signal. But Linux at present has no such feature.)
Finally, note that the clone
syscall does not take a stack pointer argument because it does not need it. The syscall must be performed from assembly code, because the userspace wrapper is required to change the stack pointer in the "child" thread to point to the desired stack, and avoid writing anything to the parent's stack.
Actually, clone
does take a stack pointer argument, because it's unsafe to wait to change stack pointer in the "child" after returning to userspace. Unless signals are all blocked, a signal handler could run immediately on the wrong stack, and on some architectures the stack pointer must be valid and point to an area safe to write at all times.
Not only is modifying the stack pointer impossible from C, but you also couldn't avoid the possibility that the compiler would clobber the parent's stack after the syscall but before the stack pointer was changed.