In general not quite, the programming model is not always exactly equivalent. You need to check both sets of docs if you want to be 100% sure.
https://en.wikipedia.org/wiki/X86-64#Differences_between_AMD64_and_Intel_64
e.g. bsf/bsr: Intel docs say they leave the destination undefined, AMD says they leaves it unmodified on zero. But in practice Intel does that to, with a microarchitectural dependency on the output register to go with it. This false-dependency infected lzcnt/tzcnt as well until Skylake, and popcnt still, on Intel but not AMD. But until Intel gets around to putting it on paper that they're going to keep making their HW behave this way, compilers won't take advantage of it, and we maybe shouldn't by hand either.
(Wikipedia seems to be saying that on Intel, the upper 32 bits of the destination might be undefined, not zeroed, for bsr
/bsf eax, ecx
on Intel, though. So it's not strictly like always writing EAX. I can confirm this on SKL i7-6700k: mov rax,-1
; bsf eax, ecx
(with zeroed ECX) leaves RAX=-1 (64-bit), not truncated to 2^32-1. But with non-zero ECX, writing EAX has the usual effect of zero-extending into RAX.)
This is especially important for kernel code, privileged instruction behaviour may have more subtle differences. I think TLB invalidation semantics mostly match, e.g. it's guaranteed on both that you don't need to invalidate a TLB after changing an invalid entry to valid. Thus x86 disallows "negative caching", so an implementation that wanted to do so would have to snoop page-table stores for coherency.
Some of this is probably unintentional, like Intel and AMD both having different bugs for sysret with non-canonical x86-64 addresses, making it not safe to use after a ptrace
system call could have modified the saved RIP. A potential GP fault can happen in kernel mode after switching to user-stack, handing control of the kernel to another user-space thread from the same process that can modify that stack memory. (https://blog.xenproject.org/2012/06/13/the-intel-sysret-privilege-escalation/) That's why Linux always uses iret
except for the common case fast-path where the saved registers are known-clean. The comments in entry_64.S
in the kernel source summarize a bit
Atomicity guarantees for unaligned cached loads/stores are weaker on AMD: boundaries as small as 8 bytes can matter on x86-64, because of AMD. Why is integer assignment on a naturally aligned variable atomic on x86? covers the common subset of that.
Cache line size has never been officially standardized. In practice Intel and AMD CPUs use 64-byte lines, and this can be queried at runtime using CPUID the same way on both.
AFAIK, memory-order rules are identical for WB at least, and probably for other types including WC and intereraction with LFENCE/SFENCE/MFENCE vs. lock add
. Although it's not clearly documented by Intel if lock
and xchg
are intended to be different than mfence
. But you're asking about the programming model itself, not just what the docs say on paper. See Does lock xchg have the same behavior as mfence? and What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE?
IDK about AMD, but NT WC loads might reorder with lock add
/ xchg
on Intel (but they're not supposed to with MFENCE, I think, and that's why an Intel ucode update had to strengthen MFENCE on Skylake to block OoO exec like LFENCE's other effect, to prevent later loads from being in the pipe at all.) @Bee's answer on the first link mentions this, and see the bottom of this. When testing real hardware, it's always hard to tell what's future-guaranteed behaviour, and what's merely an implementation detail, and that's where the manuals come in.