This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase):
How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)?
This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information showing my research/thinking and what I have tried. Below is a somewhat stream of consciousness of this.
Note that none of the questions posed below here are ones I expect to be answered; they are more rhetorical. Their purpose is to demonstrate why I can't answer the above question (despite my research, I lack the experience in this area to make definitive statements such as @PeterCordes's "branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well."). Also note that the Rust component is largely irrelevant here as this is an assembly issue. My reasoning for including it was the ahead-of-time compiled functions are written in Rust, so I was unsure if there was something that Rust did (or instructed LLVM to do) that could be advantageous in this situation. It is totally acceptable for an answer to not consider Rust at all; in fact, I expect this will be the case.
Think of the following as scratch work on the back of a math exam:
Note: I muddled the term intrinsics here. As pointed out in the comments, "ahead-of-time compiled functions" is a better description. Below I'll abbreviate that AOTC functions.
I'm writing a JIT in Rust (although Rust is only relevant to a bit of my question, the bulk of it relates to JIT conventions). I have AOTC functions that I've implemented in Rust that I need to be able to call
from code emitted by my JIT. My JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)
s some pages for the jitted code. I have the addresses of my AOTC functions, but unfortunately they are much further away than a 32-bit offset. I'm trying to decide now how to emit calls to these AOTC functions. I've considered the following options (these are not questions to be answered, just demonstrating why I can't answer the core question of this SO thread myself):
(Rust specific) Somehow make Rust place the AOTC functions close to (maybe on?) the heap so that the
call
s will be within a 32-bit offset. It's unclear that that is possible with Rust (There is a way to specify custom linker args, but I can't tell to what those are applied and if I could target a single function for relocation. And even if I could where do I put it?). It also seems like this could fail if the heap is large enough.(Rust specific) Allocate my JIT pages closer to the AOTC functions. This could be achieved with
mmap(_, _, PROT_EXEC, MAP_FIXED)
, but I'm unsure how to pick an address that wouldn't clobbering existing Rust code (and keeping within arch restrictions--is there a sane way to get those restrictions?).Create stubs within the JIT pages that handle the absolute jump (code below), then
call
the stubs. This has the benefit of the (initial) call site in the JITted code being a nice small relative call. But it feels wrong to have to jump through something. This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction). Additionally, it seems like this jump would be slower since its address is indirect and it depends on themov
for that address.
mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}
jmp rax
The reverse of (3), just inlining the above at each intrinsic call site in the JITed code. This resolves the indirection issue, but makes the JITted code larger (perhaps this has instruction cache and decoding consequences). It still has the issue that the jump is indirect and depends on the
mov
.Place the addresses of the AOTC functions on a PROT_READ (only) page near the JIT pages. Make all the call sites near, absolute indirect calls (code below). This removes the second level of indirection from (2). But the encoding of this instruction is unfortunately large (6 bytes), so it has the same issues as (4). Additionally, now instead of depending on a register, jumps unnecessarily (insofar as the address is known at JIT time) depend on memory, which certainly has performance implications (despite perhaps this page being cached?).
aotc_function_address:
.quad 0xDEADBEEF
# Then at the call site
call qword ptr [rip+aotc_function_address]
Futz with a segment register to place it closer to the AOTC functions so that calls can be made relative to that segment register. The encoding of such a call is long (so maybe this has decoding pipeline issues), but other than that this largely avoids lots of the tricky bits of everything before it. But, maybe calling relative to a non-(as pointed out by @prl, this doesn't work without a far call, which is terrible for performance)cs
segment performs poorly. Or maybe such futzing is not wise (messes with the Rust runtime, for example).Not really a solution, but I could make the compiler 32-bit and not have this problem at all. That's not really a great solution and it also would prevent me from using the extended general purpose registers (of which I utilize all).
All of the options presented have drawbacks. Briefly, 1 and 2 are the only ones that don't seem to have performance impacts, but it's unclear if there is a non-hacky way to achieve them (or any way at all for that matter). 3-5 are independent of Rust, but have obvious performance drawbacks.
Given this stream of consciousness, I arrived at the following rhetorical question (which don't need explicit answers) to demonstrate that I lack the knowledge to answer the core question of this SO thread by myself. I have struck them to make it abundantly clear that I am not posing all of these are part of my question.
For approach (1), is it possible to force Rust to link certain
extern "C"
functions at a specific address (near the heap)? How should I choose such an address (at compile time)? Is it safe to assume that any address returned bymmap
(or allocated by Rust) will be within a 32 bit offset of this location?For approach (2), how can I find a suitable place to place the JIT pages (such that it doesn't clobber existing Rust code)?
And some JIT (non-Rust) specific questions:
For approach (3), will the stubs hamper performance enough that I should care? What about the indirect
jmp
? I know this somewhat resembles linker stubs, except as I understand linker stubs are at least only resolved once (so they don't need to be indirect?). Do any JITs employ this technique?For approach (4), if the indirect call in 3 is okay, is inlining the calls worth it? If JITs typically employ approach (3/4) is this option better?
For approach (5), is the dependence of the jump on memory (given that the address is known at compile time) bad? Would that make it less performant that (3) or (4)? Do any JITs employ this technique?
For approach (6), is such futzing unwise? (Rust specific) Is there a segment register available (not used by the runtime or ABI) for this purpose? Will calls relative to a non-
cs
segment be as performant as those relative tocs
?And finally (and most importantly), is there a better approach (perhaps employed more commonly by JITs) that I'm missing here?
I can't implement (1) or (2) without my Rust questions having answers. I could, of course, implement and benchmark 3-5 (perhaps 6, although it would be nice to know about the segment register futzing beforehand), but given that these are vastly different approaches, I was hoping there was existing literature about this that I couldn't find, because I didn't know the right terms to google for (I'm also currently working on those benchmarks). Alternatively maybe someone who's delved into JIT internals can share their experience or what they've commonly seen?
I am aware of this question: Jumps for a JIT (x86_64). It differs from mine because it is talking about stringing together basic blocks (and the accepted solution is way too many instructions for a frequently called intrinsic). I am also aware of Call an absolute pointer in x86 machine code, which while it discusses similar topics to mine, is different, because I am not assuming that absolute jumps are necessary (approaches 1-2 would avoid them, for example).