I've seen this r10
weirdness a few times, so let's see if anyone knows what's up.
Take this simple function:
#define SZ 4
void sink(uint64_t *p);
void andpop(const uint64_t* a) {
uint64_t result[SZ];
for (unsigned i = 0; i < SZ; i++) {
result[i] = a[i] + 1;
}
sink(result);
}
It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink()
on the result (to avoid the whole function being optimized away).
Here's the corresponding assembly:
andpop(unsigned long const*):
lea r10, [rsp+8]
and rsp, -32
push QWORD PTR [r10-8]
push rbp
mov rbp, rsp
push r10
sub rsp, 40
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vpaddq ymm0, ymm0, YMMWORD PTR [rdi]
lea rdi, [rbp-48]
vmovdqa YMMWORD PTR [rbp-48], ymm0
vzeroupper
call sink(unsigned long*)
add rsp, 40
pop r10
pop rbp
lea rsp, [r10-8]
ret
It's hard to understand almost everything that is going on with r10
. First, r10
is set to point to rsp + 8
, then push QWORD PTR [r10-8]
, which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp
is set up as normal and then finally r10
itself is pushed.
To unwind all this, r10
is popped off of the stack and used to restore rsp
to its original value.
Some observations:
- Looking at the entire function, all of this seems like a totally roundabout way of simply restoring
rsp
to it's original value beforeret
- but the usual epilog ofmov rsp, rpb
would do just as well (seeclang
)! - That said, the (expensive)
push QWORD PTR [r10-8]
doesn't even help in that mission: this value (the return address?) is apparently never used. - Why is
r10
pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.
What's up with that? I've seen it several times before, and it usually wants to use r10
, sometimes r13
. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ
to be less than 4 it uses xmm
ops and the issue disappears.
Here's SZ == 2
for example:
andpop(unsigned long const*):
sub rsp, 24
vmovdqa xmm0, XMMWORD PTR .LC0[rip]
vpaddq xmm0, xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
vmovaps XMMWORD PTR [rsp], xmm0
call sink(unsigned long*)
add rsp, 24
ret
Much nicer!
rsp
onto the stack, and smashing the stack would clobber this value, such that the stack wouldn't be restored correctly and a maliciously overwritten return value wouldn't work - but (a) gcc can easily prove there is no smashing here and (b) it just changes the values you have to write but doesn't prevent any attack. – BeeOnRopesink
does something much worse than just prevent the function from being optimized away, it forcesresult
to be in memory. – haroldsink()
and just doesreturn result[0]
. No more writes to the stack at all but the same weirdness withr10
! – BeeOnRope