In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement
constant.
e.g. lea eax, [rbp + 12]
is encodeable as-written, and is just as fast as lea eax, [rcx + 12]
.
But lea eax, [rbp + rcx*4]
can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0]
(because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32
byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8
is 2 uops even on modern uarches like Skylake, where adc bl, imm8
is 1 uop.
So not only does the adc reg,0
special case not work for adc al,0
on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0
using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0
will assemble to the short form by default.) Only a problem with byte registers; adc eax,0
will use the opcode ModRM imm8
3-byte encoding, not 5-byte opcode imm32
.
For other cases of op al,imm8
, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx
being 1-byte vs. xchg edx, ecx
being 2 bytes.
add rsp, 8
can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang
use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?
add al, 7; add dl, 7; add r12b, 7
are 2, 3, 4 bytes respectively, the last due to the REX prefix as you note. This may slow down the time to fetch the instructions, or waste cache space, but I'm not aware that it makes any difference in the time to actually execute the instructions. – Nate Eldredgerbx
is "slower" thanrcx
simply because you would have to save and restore it at the start and end of your function. Or it can go the other way; if you call many other functions,rbx
may end up being faster because you don't have to save and restore it around every function call you make. But that's nothing to do with the machine itself, of course. – Nate Eldredge